AcostaSusan1981

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
A COMPARISON OF FOUR ORAL
LANGUAGE TESTS
A thesis submitted in partial satisfaction of the
requirements for the degree of Master of Arts in
Elementary Education,
Bilingual Bicultural Education
by
Susan Colman Acosta
January, 1981
The Thesis of Susan Colman Acosta is approved:
Dr.
Aug~ritt~ ~Ri-o--
Dr~a~owicz
Dr. C. Ray Graham
California State University, Northridge
ii
ACKNOWLEDGEMENTS
This has been a long and hard process which has been
smoothed by the help and support of many people.
First
of all, my family has endured much and I thank them.
Next, it would have been impossible without the help of
my Committee Chairman Dr. C. Ray Graham.
contributed:
Many others
Ethel Cullom, who loaned me research
material; Dr. Augusto Britton and Dr. Ed Labinowicz of
my Committee; Dr. Robert Effler, Principal of Miramonte
Elementary School and Ms. Irene Curtis of Area 2, LAUSD,
who gave me permission to use the children of Miramonte
as the subjects in this study.
The study could not have taken place without the
consent and cooperation of the following teachers at
Miramonte:
First grade - M. Merino, P. Reding, M. Block,
J. Miller, J. Lawrence, B. Earhardt, S. Sutton, C. Korten;
Third grade- T. Dooley, D. Martinez, J. Mickelberry,
P. York, K. Bentson, G. Smoot, S. Orange, and T. Lopez.
A special thanksgoes to my three examiners who
helped me make it happen:
Elena Romero, Barbara Gerlicke,
and Elizabeth Najarian.
Lastly, I must also express my appreciation for the
loving encouragement over the years that my parents, Edward
and Mary Colman have expressed.
I. only wish my mother
were still alive to enjoy the success of her daughter.
iii
TABLE OF CONTENTS
Page
iii
ACKNOWLEDGEMENT
vi
ABSTRACT
Chapter
I.
II.
III.
IV.
V.
INTRODUCTION
1
Background
2
Statement of the Problem
3
Definition of Terms
3
~
.
REVIEW OF THE LITERATURE
6
Purpose of Tests
6
What Is Proficiency
7
How Is Proficiency To Be Measured
12
The Four Tests
16
Practicality
25
DESIGN AND PROCEDURE OF THE STUDY
30
Tests . .
30
Subjects
31
Design
31
Procedure
32
Statistical Analysis
33
FINDINGS OF THE STUDY .
38
SUMMARY, CONCLUSIONS AND RECOMMENDATION
47
Recommendations . . . . . . . . . . .
51
Page
BIBLIOGRAPHY
53
APPENDICES .
59
ABSTRACT
A COMPARISON OF FOUR ORAL
LANGUAGE TESTS
by
Susan Colman Acosta
Master of Arts in Education
A study was made comparing four oral language tests
designated by the State of California for use in classifying the English language fluency of children whose home
language is not English.
The four tests were:
Basic
Inventory of Natural Language (BINL), Bilingual Syntax
Measure (BSM), Language Assessment Battery (LAB), and
Language Assessment Scales (LAS).
None of the tests was
fully adequate to fulfill the testing•and classification
requirements of LAU versus NICHOLS.
The BSM and the LAS
appeared to be the better of the four tests based on the
correlations made, the percentage of agreement on placement and for practical considerations.
More research into
the evaluation of child language acquisition is indicated.
vi
CHAPTER I
INTRODUCTION
Since the advent of Spanish/English bilingual
education programs in Dade County, Florida, in the early
1960's, there has been a growing need for instruments to
measure oral language proficiency in a variety of
languages.
As a result of the LAU versus NICHOLS Supreme
Court decision in 1974, state and federal laws require
that students be instructed in a language they can
understand.
Moreover national policy as established in
the Bilingual Education Act of 1974 calls for a national
assessment to identify limited - and non-English speaking
children for purposes of carrying out bilingual education
programs in the United States.
There is a consequent
need to be able to assess student's language proficiency/
dominance in order to place them in appropriate programs.
In response to this need, there has been a proliferation '
of tests for measuring every aspect of bilingual children's performance, especially in oral language.
A review
of bibliographies of current assessment instruments for
bilingual programs reveals more than 100 tests for the
1
measurement of language proficiency/dominance alone,
either published or in development (Center of Bilingual
Education, 1978, and Fletcher, Lock, Reynolds and Sisson,
1978).
Background
In 1978, the State of California Department of
Education designated four English language proficiency
tests to be used in the assessment process required by
LAU versus NICHOLS.
The four tests are:
Basic Inventory of Natural Language (BINL) 1977,
Bilingual Syntax Measure (BSM) 1975,
Language Assessment Battery (LAB) 1976.
Language Assessment Scales (LAS) 1975, 1977.
Each of these tests has been examined and discussed
individually by one or more of these researchers, Britton
(1975), Center for Bilingual Education (1978), DeAvila
and Duncan (1977), Gil (1976), Helmer (1977), Politzer
and McKay (1974), Fletcher, Locks, Reynolds and Sisson
(1978), and Randle (1975), but there is a need to compare
all four to determine if they measure the same skills.
School districts in California are currently faced
with the selection of one of these four tests to satisfy
federal and state requirements.
If they do not choose
one of the four, they must justify the choice of a
3
different test.
There is a need for more information
based upon a comparison of all four tests to help in the
selection process.
Statement of the Problem
This study addresses itself to this need.
Specifi-
cally it seeks to answer the following questions:
Does
each of the instruments render comparable English
proficiency scores for children learning English as a
Second Language?
Is there a difference between them that
would make one test more reliable or valid than another?
This thesis compares, through correlational statistical
analysis, all four tests, using a sample of primary grade
children to whom all four tests were administered.
The resulting information has potential value to
those faced with the task of choosing one of these tests.
By considering the specific requirements of a particular
district or school, the test that will be most appropriate
can be chosen.
Definition of Terms
Proficiency is a complex concept and has not been
well defined in the literature.
In the broadest sense,
proficiency refers to the ability to use language in a
multitude of contexts, both productively and receptively,
in oral as well as written skills.
In this thesis, the
use of the term is limited to oral proficiency, i.e,
the individual's ability to understand spoken language
and to speak fluently.
In practice, proficiency has
been divided into the following categories for state and
federal programs in California:
Non-English Speaker (NES)
Limited-English Speaker (LES)
Functional-English Speaker (FES)
Proficient-English Speaker (PES).
Dominance commonly refers to an individual's
preferred language, or to relative proficiency in two
or more languages.
In the present discussion, dominance
will refer to the comparison of speaking and listening
skills in two or more languages.
The exact definitions
of these terms, as will be seen in a later discussion,
vary tremendously and are determined by the individual
instrument under consideration.
Validity refers to the extent to which a test
measures what it purports to measure.
There are several
kinds of validity, among which are:
Face validity, or how the test appears on the
surface to the examiner and subject.
Content validity, or how well the test covers the
5
subject area being tested.
Construct validity refers to the extent to which
a test measures a theoretical construct or trait.
Concurrent or criteria-related validity has to do
with the extent to which a subject 1 s performance on
the test correlates with some external criterion
such as his/her observed performance on another
test which purports to measure the same traits.
Reliability pertains to the dependability of the
scores which a test yields.
In other words, how stable
and consistent are the scores realized from the test.
Practicality refers to such varied aspects as the
skill and time required to administer and score the test,
the cost of administering and scoring it, and the need
for special equipment.
CHAPTER II
REVIEW OF THE LITERATURE
Purpose of Tests
One of the first things that must be decided in
choosing a test is what is to be tested and why.
There
have been many attempts to develop assessment instruments
designed to measure various aspects of language acquisition in children.
Robinson (1970) and Aitken (1975) have identified
several purposes for such tests:
1.
survey to gather information about second
language competence and evaluation of whole
programs;
2.
research into effectiveness of different
teaching methods, manuals, and audio-visual aids;
3.
research into psychology dealing with an
individual;
4.
research into sociology dealing with groups;
5.
evaluation of particular progress concerning
a.
aptitude
b.
diagnosis
c.
prediction
d.
achievement
e.
classification
f.
proficiency.
Bilingual programs implemented under state and
federal laws might legitimately need tests for a number
of the purposes identified above, but the primary need
is in the area of assessment of proficiency or dominance
for purposes of program placement.
This thesis focusses
on instruments, specific instruments, designed to assess
English proficiency in children.
What is Proficiency
Structural linguists have viewed language as being
able to be analyzed and divided into sub-categories such
as semantics, syntax, and phonology.
Macnamara (1967)
developed a matrix for the language arts utilizing this
type of approach:
(Fig. 1).
Listening
Speaking
Reading
Writing
semantics
semantics
semantics
semantics
syntax
syntax
syntax
syntax
lexicon
lexicon
lexicon
lexicon
phonemics
phonemics
graphemics
graphemics
Figure 1.
Language Arts Matrix
This view of language has led many test makers to
design tests which arrive at a subject's overall proficiency in a given language by determining his proficiency
in each of the sub-areas (i.e., phonology, syntax, etc.)
and then weighting each sub-area to come up with a
numerical proficiency rating.
Within each sub-area,
skills are often measured by isolating a particular
feature (e.g. formation of the past tense, pronunciation
of the [r], etc.) within each test item.
This sort of
test has been called a discrete point test.
Aitken (1976)
connects the audio-lingual method of language instruction
with discrete point approaches and asserts it is based
on two erroneous assumptions:
1.
The surface structure of a language can
be systematically described and its
elements listed and compared with any
other language, similarly described.
2.
The mastery of a language may be
divided into the mastery of a number of
separate skills:
listening, speaking,
reading, writing.
These skills in turn
may be divided into a number of distinct
items.
It assumes that to have developed
a criterion level of mastery of skills
and items listed for that language
language. (pp. 7, 8)
The P,roblem with assumption (1) is that it is impossible
to make a list of all the items in a language.
would lead to long and unwieldy tests.
This
Assumption (2)
does not take into account the role of previous personal
experience in the assignment of meaning in language
usage.
There is, according to Carroll (1961), a redun-
dancy factor to natural language that allows one to
predict missing elements from the context (i.e., the
cloze procedure) making it difficult to say that any
given language item is essential to communicate or to
establish the functional load to any system.
In other
words, a test of many isolated and separate points of
grammar or lexicon is not a real test of language, Oller
( 1973).
Levine (1976) has pointed out that often discrete
item tests are used because they are technically objective and simple to administer and score.
They do,
however, provide certain types of "information about the
state of a learner's knowledge about a language."
The
caveat is that they do not necessarily provide information
about the ability to understand and use language appropriately in context.
She contends that a kind of vicious
cycle has grown up around discrete point testing.
If a
test is discrete, then the instruction should be similarly
discrete.
If mastery is shown by paper and pencil, then
this is the way it is learned.
Therefore, if this is how
language is taught, then this is what "knowing" a
language must be.
There are a number of other viewpoints against
discrete type testing that generally agree with the
arguments already presented, Jony (1975), Bordie (1970),
.LV
Cazden (1975b), and Spolsky, Murphy, Holm and Ferrell
(1972).
Sociologists, however, have been reluctant to
attempt to separate a subject's "knowledge" of a particular feature of his language (i.e., competence) from
his use of that feature in normal communication (i.e.,
performance).
In particular they have pointed out that
an adequate measurement of a person's linguistic
repertoire would include, among other things, considerations of what language is being used with whom in what
context when discussing what particular content material.
Communicative ability, according to Groot (1973),
consists of "linguistic and non (para, extra) linguistic
components."
He points out that the descriptive
linguistic components of tests are often considered to
be valid representation of a student's overall language
proficiency.
However,
most of the evidence indicates that communicative ability is more than the sum-total of
these linguistic components.
Or,
overall language proficiency is more that
the knowledge of vacabulary, syntax and
•
phonology. (p. 138)
This view of language argues for the other type of
language test called an integrative test.
If a discrete
form of test focusses on isolated linguistic units, then
an integrative type has the subject perform tasks similar
to those that occur in real life.
11
In distinguishing between discrete point tests and
integrative tests, Aitken (1975) stated,
Discrete point tests are based on the
assumption that there are a given number
of specific structure points, the mastery
of which constitutes 'knowing' a language
... an integrative test is one based on
the premise that 'knowing' a language
must be expressed in some functional
statement ... (p. 7)
that taps communicative competence factors.
Valette
(1977) chose to call the integrative type of test a
global language test because it measures the student's
ability to understand and use language in context.
Valette (1977) developed a matrix contrasting the
two types of tests:
(Fig. 2)
Test type
Item type
Competence
Performance
DISCRETE
point tests
multiple
choice/short
answer items
linguistic
competence
formal
performance
objectives
GLOBAL
tests
communication
items
communicative
competence
open-ended
performance
objectives
Figure 2.
DISCRETE point and GLOBAL tests
Carroll (1961) and Spolsky (1968) propose that:
1.
>''
~t
discrete point tests should be used for
controlling instruction, deciding what is
~'·":~.· :.
'
·.-·~.:
l
.
'
'
to be taught, and how well something has
been learned;
2.
integrative tests should be used for
proficiency purposes.
If one accepts the sociolinguist's view of language,
and there is good evidence that elicited speech differs
radically from spontaneous speech in children learning
English as a Second Language (Wong-Filmore, 1976), then
a child's total language proficiency would include his/her
ability to understand and express himself/herself in
every conceivable communicative situation.
include using language for:
This would
1) performing different
functions (e.g., giving directions, asking questions,
describing things, narrating experiences, telling stories,
etc.);
2) communicating with different people;
3) communicating appropriately in different social
contexts (e.g., formally, informally, intimately, etc.);
and 4) communicating about different topics (e.g., home,
school, neighborhood, animals, plants, etc.).
Obviously
to sample a child's performance in all of those areas
would be totally impractical if not impossible.
How Is Proficiency To Be Measured
As can be seen from the above discussion, language
proficiency is an extremely complex phenomenon whose
13
measurement requires sampling of behaviours along a number
of dimensions.
The task of the test designer is to
develop a procedure which will insure that the language
samples collected will be representative of the child's
entire repertoire of important linguistic abilities.
This does not necessarily mean that the instrument must
sample each language behaviour, but in order to be a valid
measure of language proficiency, the skills measured must
bear an identifiable relationship to all of the abilities
necessary to the subject's overall language proficiency.
If proficiency is to be adequately measured, instruments
must meet certain criteria of validity, reliability and
practicality.
There are two kinds of approaches to validity:
logical analysis in which one tries to judge precisely
what the test is measuring, such as content validity; and
empirical analysis in which one tries to relate the test
to a criterion known or assumed to measure some characteristic, such as concurrent and predictive validity.
Construct validity uses both logical and empirical
analysis.
Oller (1977) warned that it is possible that a test
labelled as a proficiency test might really be testing
and classifying I.Q. or some other educational objective,
such as reading.
Hillman (1972) sees validity used as
a predictor of certain vocal-verbal features conditioned
by circumstances surrounding those vocal-verbal features.
There is research by McDavid (1966) which indicated
that stress, intonation, pitch and associated paralanguage gestures are
more indicative of language ability than any
of the other usual language characteristics
normally thought significant for measurement,
such as syntax, vacabulary, grammar, and so
on. (p. 53)
He indicated that although tests are reliable instruments
which can measure accurately and consistently, they may
not be measuring the right thing.
In fact, many language
tests really only measure written English with little
oral production, as in the Language Assessment Battery.
Scharf (1972) was concerned about the variability
in the rate of increase in language which makes the use
of age, as a basis for comparing children, a poor one,
since they do not develop consistently.
Children may
make reversals as a part of their normal acquisition of
language.
This concern is relevant for this study
because all the tests under consideration in the study
had either different
form~
of the test given according
to age, or used age as a factor in assigning a linguistic
category according to a score, as occurred with the BINL.
Another factor which must be taken into account is
the interaction between examiner and child.
Various
researchers have found that the type of stimulus materials,
15
among other things, can have an effect upon the child,
Phillips (1966), Cohen (1975), and Condon (1975).
Similarly, age, sex, and socio-economic status of the
child can have an effect, Cowan, Weber, Hoddincott, and
Klein (1967).
According to Swain (1976), certain aspects of
language are easier to measure:
What rarely are measured are those aspects
of language which are difficult to measure ...
becuase they are not well enough understood
to develop a relevant test, or because the
collection and analysis of the data are simply
too time consuming. (pp. 13, 14)
In other words, usually language tests measure skill
aspects of language rather than communicative, creative,
or aesthetic aspects.
Often, too, ntests tend to be the
only accepted means of obtaining performance data."
Reliability is important also.
one needs:
To be dependable
multiple samples of the skill being tested;
standard tasks, so that all subjects are required to
perform the same task; standard conditions of administration; and standard scoring or interscorer reliability.
I
Practicality is the other factor that is germane
to the selection of a test.
Considerations of time,
expense, skills needed for administration and scoring,
and ease of administration are all things that can not
be ignored.
A test may be valid and reliable, but it it
costs too much, it will not be used.
..
·
Similarly, j_f it
Hi
is awkward to administer or difficult to score, it will
not be used.
To summarize, an instrument needs to be valid in
what it tests and how it tests for that objective; it
needs to be reliable so that it can be used consistently;
it needs to be practical in terms of cost, time, effort
and skill needed to administer and score it.
The Four Tests
In view of the previous discussion in this chapter,
it is appropriate to examine what has been written about
the tests in the study.
There were few, if any, published
references that dealt with the four tests.
Most often
the tests were mentioned as existing, but with no
critiques, only usually factual descriptions.
Appendix A)
(See
The test information that is available is
primarily limited to accompanying technical manual
written by the developers.
A matrix was developed to
compare the four tests used in this study.
(See Figure 3).
BINL
LAB
P'-1
t:"'i
t:"'i
Listening
H
12.5% of score
35 items
phonemes
(/}
t-3
BSM
12.5% of score
30 items
sound discrimination
minimal pairs
PHONOLOGY
(/}
LAS
Production
trJ
(/.l
t-3
trJ
t:J
lJj
lo,:j
1-<l
f-'•
lo,:j
~
0
c::::
aq
"i
CD
LEXICON
Listening
::0
0
w ::0
::t>
t:"'i
Production
t:"'i
::t>
~
35% of 20 itmes
7 responses to
responses to
questions pointing to picture
of body part
12.5% of score
10 items
identify correct
picture in response
to taped statement
50% of 20 items
identifying
'parts of body,
pictures
12.5% of score
20 items
identify pictures
Q
c::::
::t>
Q
trJ
SYNTAX
(50% see below)
taped story with
4 guide pictures
Listening
t-3
trJ
(/}
t-3
(/.l
Production
100%
10
spontaneous
samples
LAB I
15% of 20 items
3 responses to
questions
LAB II
100% of 14 items
elicited response
50% of score
retell story
heard on tape
100%
26 responses
to questions
BINL
The BINL, Herbert (1977), as shown in Figure 3
measures oral production skills.
The discrete point
approach to language testing is rejected by the author,
Herbert.
His orientation is towards what he terms
natural language.
Therefore no particular structures
or elements are sought and questioning is not allowed
during the language sampling except as it might come up
in a natural conversation.
The act of describing or
telling a story about a picture is considered to be a
natural situation.
Ideally the test is meant to be tape
recorded as a conversation between peers and the role of
adult examiner is supposed to be minimized.
not been true in actuality.
This has
Herbert stated that
Evaluating language production involves two
major aspects of language ability.
l. We
must consider the language competence of
the child; the language the child has at his
command to express his thought.
2. We must
measure the language production of the child;
that which he says.
(p. 6)
Validity
The manual indicates three validity studies that
were done.
The first study dealt with the scoring system.
The contention was that there was a positive correlation
between the average sentence length to the level of
complexity.
The sample size was not large (182 English
Dominant Speakers and 160 Spanish Dominant Speakers) and
no theoretical basis was given for making that type of
correlation.
The second study relied on a single cor-
relation done by Fresno Unified School District about
the BINL's validity for determining Dominance and
Proficiency.
not indicated.
The particular value for this study was
The third study compared the BINL Oral
Language Complexity score with the Gilmore Oral Reading
test, which consists of graded paragraphs of increasing
difficulty which are read aloud, and found a relationship
between the two tests.
Nor reason for the use of that
particular test for correlational purposes was given.
A
Spearman-Brown split half coefficient was also taken that
concluded that there were
consistent levels in oral language complexity
across the 10 sentences spoken
for those students dominant in either language.
Nothing
was indicated about those who were LES.
The claim is made that syntactic complexity parallels
language development in children and there is a direct
correlation between the
two~
However, Labov (1973),
Pope (1974) and Goldberg (1972) found that complexity of
utterances can be influenced by the task, i.e., the
particular topic being discussed.
The semantic content is not measured.
1
It does not
matter what a child says as long as it is grammatically
20
correct, i.e., a child might say, "I see a dog" when it
is an elephant in the picture.
The only task is one of storytelling.
There is no
indication whether this task correlates with other ones
such as giving directions or asking questions.
Reliability
The testing situation lends itself to many interpretations.
Cazden (1975) and Phillips (1975) had
doubts about the reliability of tests using this procedure because the testing situation, in and of itself,
lends itself to distortion, due to anxiety and hypercorrection, since most testing situations are contrived
ones.
There are other sociolinguistic variables also,
such as the race of the examiner, that have not been
considered.
The manual mentions that a split half correlation
was done on an experimental group to determine the
reliability of the instrument.
While this is useful and
provides some reliability information for the, tests, it
does not deal with the weightier problem of variability
in testing conditions which might be revealed in a testretest study.
The fact that the instrument has been in
use for several years and no such test has been conducted
raises suspicious regarding the reliability of the test.
21
BSM
The BSM was developed by Burt, Dulay and Hernandez
(1975).
It has a very complete theoretical framework
that
is derived from the assumption that children
acquire a second language by a process of
creative construction.
(p. 11)
They used many sources on child acquisition of languages,
both L
and L as a basis for the design of the BSM.
2
1
There seems to be, according to certain studies, a common
order of acquisition of certain English grammatical
morphemes by children acquiring English as a second
language in the United States.
After considering and
rejecting pronunciation, vocabulary and functional use
of language as possible indicators of proficiency, the
authors chose a discrete point approach using syntax as
the sole measure.
The test was constructed so as to
elicit naturally a range of structures in varying phases
of acquisition using a structured conversation technique.
Questions were asked referring to one or more of a set
of pictures.
Validity
The authors of the BSM chose to use construct
validity as the most appropriate way of approaching the
question of validity.
According to Burt, Dulay and
and Hernandez, the construct validity of the BSM is
supported by:
1.
the body of psycholinguistic theory and
research supporting the concept of natural
sequences of acquisition;
2.
the statistical procedures and research
which have resulted in the production of
the acquisition hierarchy and of the
scoring system which is based on this
hierarchy; and
3.
evidence that the BSM classifications reflect
the relationships expected to be found among
bilingual children. (p. 32)
The only linguistic aspect tested is syntax.
Several
studies cited in a paper by Krashen (1978) showed that
children acquiring English either as a first or second
language showed a similar order of acquisition grammatical
morphemes.
This "natural order" of language acquisition
theory is not without its critics (Rosansky, 1976), some
of whom achieved different results using the BSM than
those indicated by the authors.
Reliability
The two major reliability studies undertaken dealt
with test-retest reliability and interscorer reliability.
On the test-retest study, one out of three students
scoring on levels 4 and 5, which are the most crucial
levels for discriminating between subjects for program
placement, was placed at least one level higher or lower
on the post test than on the pre-test.
,
..
.•
):
This means that
23
one out of three students could be misplaced by a single
administration of the BSM.
However, it was ascertained
that the interscorer reliability was higher for the
English test than for the Spanish test due to there
being more disagreements over the proper form of the
many dialects of Spanish spoken.
LAB
The LAB was developed by the Board of Education of
New York City in response to a Consent Decree of 1974.
It is a discrete point test with a great emphasis upon
receptive skills, since in each of the three versions
there are far more items in listening than in any other
oral language skill,
upon production.
There is also less emphasis placed
Since the LAB was intended to be given
only as a battery, it was not possible to use only the
oral production sections separately and we were unable
to use the results from this test in the study.
Therefore,
no further discussion of its validity or reliability
will be attempted.
LAS
The LAS was developed by DeAvila and Duncan (1975).
It measures a student's performance across four linguistic
subsystems:
phonemic, referential, syntactical and
pragmatic.
There is a theoretical basis given for the
inclusion of each subsystem.
It provides a profile of
the linguistic problems of the individual child, which
is one of its strong points.
In order to avoid problems
of interscorer reliability, a tape cassette is used for
almost all of the test so that the children tested hear
the same thing in the same manner.
It is somewhat complex
to administer since the examiner must handle a tape
recorder, a scoring sheet and a test booklet with faint
pictures, and also at one point must write down a story
told by the child.
Validity
The LAS is discrete point for the first four parts
of the test and then changes in the last section with
the re-told story.
well.
In
th~
The two halves do not correlate very
study for internal reliability, the syntax
production is excluded.
.89 for phonemes,
The totals for subsystems were:
.87 for minimal pairs,
and .68 for comprehension.
.72 for lexical,
The comprehension coefficient
appears to be lower than'the others.
There does not seem
to be very much connection between the two halves of the
test.
For instance, the pronunciation section is con-
trasted with the storytelling where accent is irrelevant.
Yet the storytelling section is worth half the total
score.
The same problems occurred in the production
section as with the BINL, where content was not particularly important.
The validity studies done by the authors dealt
primarily with the syntactic section.
An attempt is
being made to show criterion validity, i.e., the
classification received on the LAS correlates with
academic achievement.
Those studies are in progress.
Reliability
As mentioned above, the authors of the LAS are the
only one of the four tests studied who controlled for
interscorer reliability.
This was done by using a
cassette to administer the majority of the test items.
The only problem was with the children in the study who
were unused to a tape recorder.
They seemed overcome
by it at times.
Practicality
All oral language tests must be administered
individually which has a direct effect upon considerations
of practicality.
Practical experience in the administra-
tion and scoring of all four tests in the study have led
to the following conclusions.
In terms of time and
equipment necessary to administer the tests, LAS required
the most equipment and the most time, since there was no
cut-off mechanism when a child was obviously NEG.
A tape
recorder and pre-recorded tape, scoring sheet and pencil,
and test booklet with pictures were required.
The LAB
required a scoring sheet, instruction/picture booklet,
student answer sheet and pencils.
Since the written part
was not administered for this study, it is not as familiar
to the researcher.
However, it is the only test that
could be administered to a small group.
The BSM used a
picture booklet, a scoring sheet with the questions on it
and pencils.
There was a cut-off mechanism to use if the
child could not answer certain questions.
The BINL
required pictures and a tape recorder to administer.
The
taped response was later transcribed and scored.
As far as the skill and training necessary to
administer the four tests, the LAB was aruninistered
according to the manual without any problems.
The BSM
was also explicit in what the examiner was to do and say.
The LAS, since it used a cassette, avoided problems in
this area.
For the re-told story part, questions were
indicated on the scoring sheet.
The BINL was the only
one where the training and the manual appeared to be
inadequate.
Even though the Los Angeles School District
provided the training at several inservices, there was
a great deal of variety in the actual administration of
the test.
Ease of scoring is very important and, since none
27
of the tests except the LAB, is totally discrete, there
is an element of judgment
involved.
The BSM is easier
in terms of the desired structure being indicated.
The
first part of the LAS is very simple, since only incorrect
responses are indicated on the score sheet.
However, the
oral production section involves a great deal of judgment
to score.
To summarize, a certain amount of time is involved
in the administration of the tests, more for the LAS than
the others.
A certain amount of skill and training is
necessary to administer the four tests, with the BINL
requiring more intensive training.
The scoring is simpler
for the LAB and the BSM and more difficult for the LAS
and the BINL.
tedious.
The transcription of the BINL can be very
It is also an inhibitor to the examiner and
possibly to the subject when answers must be dictated
and written down.
All of the four tests had weaknesses to one degree
or another.
In a paper presented at the National Associa-
tion of Bilingual Educators conference in Seattle in 1979,
Dieterich, Freeman and Crandall discussed some findings
of a recent study of proficiency tests.
Their conclusions
were that the tests that exist today are appropriate for
discriminating between the extremes of proficiency but
are unable to distinguish more exactly the range of
proficiency between these two extremes.
Almost every test
fails to measure what they say they are measuring.
For
instance, the LAS purports to measure the understanding
of a passive sentence where only one picture shows the
subject of the sentence and therefore it is a measure of
vocabulary.
Similarly the LAB asks a sentence that
intends to test grammar, but again only one picture of
the subject is shown and so it is vocabulary that is
tested.
The BINL type of test where a child is asked to
tell a story or describe a picture tends to penalize
the child who is conservative in the language learning
style.
There is a problem, too, with those tests that
require a complete sentence for an answer since normal
speech is somewhat elliptical.
The assignment of
complexity level to utterances is debatable due to
assumptions made about the complexity of certain
structures.
The result of this study, according to Dieterich,
Freeman and Crandall, is that
those tests which adhere to the evidence
about English language acquisition are
certainly more useful, but they are not
enough. Until we have more evidence of
what linguistic, social, and other skills
children actually need to enable them to
function in an English-speaking classroom
(including cognitive strategies), the
people in schools who are probably in the
best position to know ... are the teachers
who deal with the students on a daily
baiss.
(p. 20)
They also feel that more than one measure is needed to
supplement the teacher's judgment.
The only problem with teacher judgment is that
there are some teachers who are not truly perceptive and
feel if a child can carry on a simple conversation or
obey classroom commands, then that child is fluent.
In
that situation the test then is necessary to give a
somewhat more accurate assessment of the child's
proficiency.
Thus it becomes evident that there is little
agreement on type of test, what is to be tested, and
whether or not it should be standardized or criterionreferenced, which leads us lack to the idea that it may
not be possible to measure the language of children with
anything more than superficiality.
In Chapter III the design and procedure of the
study will be discussed.
CHAPTER III
DESIGN AND PROCEDURE OF THE STUDY
The purpose of this study was to administer four
oral English language proficiency tests to a target
population to determine whether the tests were measuring
the same factors, and if they were assigning the children
to the same linguistic categories.
Tests
The four tests and the range of administration are:
1.
Basic Inventory of Natural Language (BINL)
K - 12
2.
3.
Bilingual Syntax Measure (BSM)
Level I
K -
2
Level II
3 - 12
Language Assessment Battery (LAB)
Level I
K -
2
Level II
3 -
6
Level III 7 - 12
4.
Language Assessment Scales (LAS)
Level I
K-
Level II
6 - 12
5
30
01
Subjects
The target population for the study was identified
as 150 first graders in 8 classes and 145 third graders
in 8 classes, identified as speaking other than English
at horne by the Horne Language Survey required by the
Los Angeles Unified School District to satisfy LAU
requirements.
The survey form is completed by the
child's parents indicating what languagejs are used at
horne, what language the child hears or speaks.
majority of the children were from Mexico.
is an inner-city school.
The
The school
Random samples of 61 first
graders and 64 third graders were selected using class
lists for names.
The combined sample totalled 125
children.
Design
To minimized testing order and examiner effect, a
counterbalanced design was used.
The subjects for each
grade level were randomly assigned to each of the four
examiners, for whom the order of testing was varied.
The tests were administered in the following order:
Examiner I
- BINL, LAB,
Examiner II
-
BSM,
LAS
BINL
LAS,
BSM,
LAS,
Examiner III - BSM,
LAS,
BINL, LAB
Examiner IV
- LAB,
BINL, LAS,
BSM
32
Procedure
The examiners were all native speakers of English
who were bilingual in the children's first language.
They were trained in the administration of the instruments
through five hours of inservicing by the author of this
study, except for the BINL, which was conducted by the
Los Angeles Unified School Disctrict.
All teachers
received BINL inservice as part of a district-wide
program.
Test manuals were available during the training.
One week of testing was planned per grade, with one
test administered to a child per day.
All first graders
were tested in the morning since their school day ended
shortly after lunch.
The third graders were tested
during the morning and afternoon.
All testing was
completed in four weeks, including testing those children
absent on the regular testing day.
There was strict
adherence to the order of testing as detailed above.
Each child was tested individually with only one
test given per day.
In order to gain the cooperation of
t,he teachers whose classes were involved in the study,
an agreement to do their group BINL testing was made.
BINL testing was conducted in the classroom in accordance
with the school district's policy.
The other tests were
given outside the classroom where possible so as not to
disturb the teachers and lose their cooperation.
33
Only one form of test was used for the BINL and the
LAS tests.
LAB.
Levels I and II were used for the BSM and the
To include as many levels of the tests as possible,
first and third grades were selected as the target
population.
The BINL transcriptions, made by the examiners,
were sent to the BilinguaVESLsection of the Los Angeles
Unified School District to be machine scored with the
rest of the school's tests, since the results were to
be used by the district as well as this study.
The
remainder of the tests were handscored by the researcher.
In order to have as parallel a testing situation as
possible, only the Speaking and Listening, Level I, and
the Speaking, Level II sections of the LAB were used.
The written sections were not used.
However, since the
LAB was intended to be given as a battery of tests, the
scoring manual did not give any separate test equivalents,
and the results could not be subjected to the statistical
analysis along with the other tests.
For the other three examiners' reactions to the
tests and their administration of them see Appendix B.
Statistical Analysis
With the exception of the BINL, which was machinescored by the school district, the tests were hand
scored by the researcher.
Both the BSM and the LAS were
complicated to score, since there were some judgments
that had to be made about the complexity of responses.
The LAS was more difficult to score because it required
scoring of the dictated re-told story, using samples in
the manual which were scored according to the age of the
subject.
After administering the tests and scoring them,
there were four scores for each child.
These were first
arranged by grade level and examiner (See Appendix C).
After keypunching these scores, two data decks were
obtained:
one with the raw scores, and one with the
converted or Z-scores.
It was necessary to use z-scores for comparison of
different tests.
The z-score is a standard score that
is defined as the distance of a score from the means as
measured by standard deviation units (Ary, Jacobs,
Razavieh, 1972).
z =
The formula for finding the z-score is:
X 0:
X=
X
(J
where
X
=
X=
a
=
raw score
the mean of the distribution
the standard deviation of the distribution
x = deviatibn score (X - X)
35
The size of the sample, N=l25 (61 first graders and
64 third graders), was sufficient to use several correlational measures.
Ary, Jacob, and Razavieh (1972) define
correlation as
The statistical technique used for measuring
the degree of relationship between two
variables
Correlations show us the
extent to which values in one variable are
linked or related to values in another
variable ... The measure of correlation
between two variables results in a value
that ranges from -1 to +l ... A coefficient of correlation near unity, -1 or
+1, indicates a high degree of relationship
Correlation coefficients in educational
and psychological measures, because of the
complexity of these phenomena, seldom reach
the maximum points of +l or -1. For these
measures, any coefficient that is more than
.90 is usually considered to be very high.
(pp. 115, 116)
Three correlations were considered:
the Pearson product
moment coefficient (r), the.Spearman rho, and Kendall's
tau.
The Spearman rho was not used since there were a
lot of tie scores in the study, which can affect rank
ordering, a part of the process used in obtaining the
Spearman rho.
The Pearson product moment coefficient is obtained
from the mean of the z-score products, i.e., each
individual z-score on one variable is multiplied by
his/her z-score on the other variable.
These paired
z-scores are added and this sum is divided by the number
of pairs.
The Pearson r is an interval scale and is
related to the mean.
z
r
=
The formula for the Pearson r is:
z
X
y
N
Since there were four variables, more than one coefficient
had to be figured.
As mentioned earlier, the LAB scores
could not be used in many of the calculations because
only part of the battery was used and there was no provision made in the scoring manual for this.
In addition Kendall's tau was applied to the scores.
This, like Spearman's rho, is non-parametric and does
not depend upon a normal distribution.
It is related to
the phi coefficient, which is a nominal scale (genuine
dichotomy, i.e., male/female) characteristic of both
variables.
In order to translate the linear relationship into
more practical terms the following question was asked:
were the four scores assigning the same child to the
same linguistic category (Non-English Speaking, LimitedEnglish Speaking, Functional-English Speaking, or
Proficient-English Speaking) each time?
To answer the
question, a cross tabulation of the individual converted,
or z-scores, was run on the computer as well as a
scattergram which gave a graphic representation of the
distribution of the scores.
Chapter IV will discuss the finding of this study.
Statistical correlations were made for each of the tests
and have been analyzed in the following chapter.
CHAPTER IV
FINDINGS OF THE STUDY
One of the problems with this study was that while
dominance and proficiency measures were the goal of these
tests, i.e., the same test or its equivalent was given
in two languages-English and the child's native onethe term dominance itself is ambiguous.
Silverman (1977)
feels that language dominance as a term needs to be well
defined before testing for it.
Thus the use of tests to
measure both proficiency and dominance is thrown into
doubt.
As indicated in the preceding chapter, the statistical techniques applied to the results obtained in this
study were correlation coefficients.
They were used to:
determine if each of the instruments rendered comparable
proficiency scores for the subjects; and ascertain a
difference between the tests that would indicate that
one was more reliable or valid than another.
The Pearson r correlation is shown in Table 1,
using all four tests.
39
Table l.
BINL
The Pearson r correlation
BSM
LAB
LAS
1.0000
.5206
.7874
.6706
LAB
.5206
1.0000
.6754
.5908
LAS
.7874
.6754
1.0000
.7556
BSM
.6706
.5908
.7556
1.0000
BINL
N=l25
No correlation was over .7874.
According to Ary, Jacobs,
and Razavieh (1972) this is not particularly a high
correlation.
The variance is .62, which shows less than
two-thirds of the variance is common to the tests.
The other correlation used was Kendall's tau (Table
2) which makes no assumption about the distribution of
the cases.
It is used to determine if the two rankings
of the same cases are in the same order
Table 2.
BINL
BINL
1.0000
Kendall's tau correlation
LAB
LAS
BSM '
.6316
.4894
.6177
LAB
.3956
.3956
1.0000
LAS
.6316
.4893
1.0000
.6915
BSM.
.6177
.5364
.6915
1.0000
N=l25
.5364
Here the highest correlation is .6915 between the LAS and
the BSM.
When the coefficients are squared, however, the
.4781 result shows that less than one half of the variance
is common to the two tests.
The statistical procedure that showed greater
relationships or showed the relationship most clearly
between the three tests for which converted scores could
be obtained (excluding the LAB) was the crosstabulation
of the individual test's converted scores (see Tables
3, 4, and 5).
The crosstabulations compared pairs of
tests on their placement of children in the four
linguistic categories defined earlier:
NES, LES, FES,
and PES (see Appendix C for cut-off scores for assigning
subjects to these categories).
Since pairs of tests
were compared, it was necessary to run three different
crosstabulations.
In each table, the categorizations
that agreed, i.e., where both tests placed the same child
in the same category, are indicated by a box around them.
CROSSTA~:tiLil.'T'IO~i
OF 1-'LACE!~EN'T j3,\.SED CN HID I VIDUAL TEST'S
CC~~~RTEU
'fABLE: 3,
SCORES
Crosstabuliltion of LAS and :nl\TL scores
LAS
BINL
!'.'"ES
LES
FES
PES
TOTI.L
!ES
@
1
2
1
60
.Es
26
ill
6
1
39
;_~Es
8
3
ill
1
18
''ES
0
')
2
0
8
90
1 :~
16
7
125
TCTi ;:,
TABLE 4.
CrosstabuJ.ation of LAS and BS:! scores
LAS
:--.~ ~:)!· 1
NES
NES
LES
FES
~
7
3
2
9tl
@]
1
0
4
2
ill
1
10
DJ
13
LC:S
3
FES
0
PES
1
3
5
90
12
16
TOTAL
TABlE 5.
·r
Crosstabulation of i3INL and
:'.S~·l
PES
TOTAL
7
125
PES
TOTAL
scores
BSN
NES
BINL
LES
FES
0
1
1
60
2
5
39
18
NES
[ssJ
LES
29
FES
9
1
ill
3
PES
2
0
2
[1]
98
4
10
13
TOTAL
m
8
125
If the crosstabulation of placement for the BINL and
the LAS is examined (Table 3), it is seen that out of a
total of 90 NES children identified by one test but not
the other, the two tests agreed on 56 out of 125 as being
NES.
There were 26 children that the LAS categorized as
NES that were categorized as LES by the BINL, and 8
children that the BINL considered FES as opposed to NES
for the LAS.
For the LES category, there were 39 identi-
fied by the BINL and 12 by the LAS, with agreement on
only 6 of the children.
The total for the FES category
was closer, 16 for the LAS and 18 for the BINL, but with
agreement on only 6 scores.
In the PES category there
were 7 scores for the LAS and 8 for the BINL with agreement on 4 scores.
There was an overall agreement for
this pair of tests of 57.6% (72 out of 125 scores).
Looking at the LAS and the BSM crosstabulation
(Table 4), it is seen that there is more agreement between
these two tests at the NES category, with the LAS assigning 90 to NES and the BSM assigning 98 with agreement in
86 cases.
However, in the LES column it is evident that
there was little agreement between the 12 LES scores of
the LAS and the 4 LES of the BSM, since the BSM assigned
7 out of 12 LES scores of the LAS to NES.
In the FES
column, the LAS assigned 16 scores to FES and the BSM
only 10 with agreement on 7 of the scores.
In the PES
43
column, LAS placed 7 of the children, while the BSM
placed 13 with agreement on 4 scores, the same as with
thh BINL and the LAS.
The overall agreement between
the LAS and the BSM was 77.6% (97 out of 125 scores).
The crosstabulation for the BSM and the BINL
(Table 5) was different in that the BSM assigned more
children to the NES category, 98, than the BINL, 60,
with agreement in 58 of the cases.
There was also a
discrepancy in scores in the LES category, with the BSM
assigning only 4 scores while the BINL assigned 39
scores the LES with agreement on 3 scores.
The BSM
placed 19 scores in the FES category and the BINL put
18 children in that category, with agreement on 5
scores.
In the PES category, the BSM assigned 13
children and the BINL assigned 8, with agreement on 4
children.
The BSM placed more children in the PES
category than did the other two tests.
There was overall
agreement of 56% (70 out of 125 scores).
Looking at the scores another way, the Summary
table (Table 6) allows a comparison of the three tests
'
in their placement of children in linguistic categories.
44
Table 6. Summary of LAS, BINL, and BSM category placements
NES
LES
FES
PES
TOTAL
BINL
60
39
18
8
125
BSM
98
4
10
13
125
LAS
90
12
16
7
125
It appears, with the exception of the NES category, that
there is not very much agreement between the tests in all
categories as far as placing children at linguistic
levels.
Another aspect of the crosstabulation is that there
were children placed in opposite categories by the tests.
For instance, one child who was placed at the PES level
by the BINL was put at the NES level by the LAS.
Similarly, two children classified as NES by the BSM
were placed at the PES level by the LAS and vice-versa
for another child.
A like situation occurred with the
BINL and the BSM (Table 5).
It appears that a child may
receive a different linguistic classification depending
upon what test is administered.
Reviewing the tables, it would appear as though the
BINL produced more LES scores and fewer NES scores than
'
(
•
~
~
the other two tests.
It is difficult to determine
45
exactly why this is so.
Does this mean that the BSM and
the LAS tests are "harder"?
Were the results skewed by
the fact that the first grade produced far more NES
children?
One explanation might be different language
acquisition patterns of children at different ages.
Another explanation might be differences in the length
of exposure to the English language, with the first
graders not having had as much exposure as the third
graders.
Another reason for the discrepancies is that the
instruments are measuring different attributes.
What one
test measures in the name of language proficiency is not
what the other measures.
As mentioned above, there are
serious reliability problems with the instruments.
There-
fore it is not surprising to get different results from
the tests even if they were measuring the same attributes.
It was evident that pupil error existed.
The popula-
tion being tested consisted, among others, of many
children directly from Mexico.
They were unaccustomed
to tape recorders, which figured in the BINL and the LAS
tests.
They were unfamiliar with the testing situation
itself, especially the first graders, and were culturally
reluctant to speak at all, especially in a language which
they did not control well or at all.
Even though the LAS used a pre-recorded tape, the
very first part of the test bothered a number of the
first graders because it dealt with a concept that many
had not mastered in any language - same or different.
Several children broke down and cried when faced with
the task and one child could not continue.
In Chapter V a summary will be presented, conclusions will be given and some recommendations will be
presented.
CHAPTER V
SUMMARY, CONCLUSIONS AND RECOMMENDATIONS
It is evident from this correlational study that
the four tests, BINL, BSM, LAB, LAS, do not test the
same linguistic areas.
Three of them, BSM, LAB, and
LAS tested syntax, while the BINL did not do this
directly.
If a correlation coefficient of .90 is con-
sidered to be high, Ary, Jacobs, and Razavieh (1972),
and in this study there were coefficients ranging from
.787 to .521 for the Pearson rand from .396 to .692 for
the Kendall's tau, then it would follow that not only
is there a low correlation, but also there is a lack
of agreement in placement.
As discussed in Chapter IV,
between the three tests that were crosstabulated (Table
3, 4, 5) there were instanceswhere the LAS and the BSM
placed children in opposite categories with 77.6%
overall agreement for all the scores; and also where
the BINL and the BSM did'the same with an overall
agreement of only 56% for all the scores.
The greatest
variance obtained using the Pearson r (Table l) was
.62 and for the Kendall's tau (Table 2) was .48.
Therefore they do not seem to measure the same thing.
48
As discussed in Chapter II, there are different
theoretical bases for the tests, and quite possibly
other variables, such as inter-examiner reliability,
may have entered into the results.
A summary of the negative and positive aspects of
the tests based on the literature review and personal
experience follows:
1.
BINL
a.
Lacks inter-examiner reliability - the
administration can vary widely.
b.
There is not enough validity shown construct validity is lacking.
i.
The scoring values are arbitrarily
assigned.
ii.
The technical manual is too ambiguous
regarding transcription.
c.
It doesn't show the full range of child's
control of language structures.
2.
LAB
a.
It is a discrete point test.
i.
Younger children often don't read or
write.
ii.
Not enough justification is given for
the items on the test.
iii.
It was only normed in New York City with
Puerto Rican children.
49
3.
BSM
a.
Having to take down dictated responses can be
difficult.
i.
It leads to inter-examiner reliability
problems.
ii.
It can inhibit a child's production,
i.e., if the examiner doesn't hear the
response the first time.
b.
The theory, although sound, has been
challenged by Scharf (1972).
c.
Since it is an interval item test, it may
not have enough items between categories.
d.
There is concern about the content of the
story in English, i.e., drinking
~ink
ink
(children can be suggestible).
4.
LAS
a.
It tests most linguistic aspects.
i.
It would be better used as a diagnostic
test.
b.
The concept of "same or different" on the
minimal pairs part of the test may not have
been acquired by younger children.
c.
It is difficult to administer.
i.
ii.
Dictation is hard to record.
The juggling of materials is hard on
the examiner.
iii.
The pictures in the test booklet are too
faint to be seen easily.
d.
The tape is intimidating to some children
but it does deal with the problem of interexaminer reliability.
e.
It appears to have been developed with more
research than the BSM, LAB, or BINL tests.
Another problem with the LAS test was the amount of time
necessary for administration.
The BINL and the BSM tests
had cut-off mechanisms for the NES child where the
testing would stop more quickly.
whole tape had to be played.
With the LAS, the
This alone would make it
very difficult and time-consuming to use in situations
where large numbers of children are to be tested and
where there are time constraints.
It appears that none of these four tests is fully
adequate to fulfill the testing and classification
requirements of LAU versus NICHOLS.
The better tests of
the four seem to this researcher to be the LAS or the
BSM based on the correlations, the percentage of agreement
on placement and for practical considerations.
At present,
however, it seems as if it may actually be impossible to
ascertain exactly how much language a child controls in
any but a superficial way .
.
;.
,~
.
·.•
I
~
51
Recommendations
With the use of tests having questionable validity,
there is a real chance that a child might be placed in
the wrong program.
The cost in human potential wasted
is alarming to consider.
However, with the continued influx of children
speaking languages other than English into the school
systems of California, and the mandates of LAU versus
NICHOLS, there must be linguistic classification of these
children for program purposes.
As a stopgap measure,
since they must be placed, a combination of tests plus
teacher observation can be used.
This combination, in
turn, has the limitation of subjectivity of observation
and must be applied with extreme care.
It is imperative that the whole process be improved
since it is children who are being classified and whose
educational success is at stake.
A test is needed that:
1.
will show proficiency in English;
2.
is rel'a t i vely easy to administer, especially to
large groups of children;
3.
is relatively easy to score;
4.
is reliable and valid.
Any replication of this study should, among other
things, increase control for inter-examiner reliability.
52
The LAB cannot be used successfully except in its entirely
and therefore should not be included in the study.
BIBLIOGRAPHY
Aitken, Kenneth G., "Discrete Structure-Point Testing:
Problems and Alternatives", TESL Reporter, Vol.9,
No. 4, 1976, pp. 7-9, 18-20.
Anderson, Scarvia B., "Verbal Development in Young
Children:
Strategies for Research and Measurement",
Paper at International Congress of Psychology,
August, 1972.
Assessment Instruments in Bilingual Education, Center for
Bilingual Education, Northwest Regional Educational
Laboratory, California State University, Los Angeles,
Ca., 1978, pp. 10-ll, 26-27.
Blatchford, Charles H., "A Theoretical Contribution to
ESL Diagnostic Test Construction", Paper at Fifth
Annual TESOL Conference, March, 1971.
Bordie, John G., "Language Tests and Linguistically
Different Learners:
The Sad State of the Art",
Elementary English, 47.5, October, 1970,
pp. 814-828.
Briere, Eugene, "Current Trends in Second Language
Testing", Papers on Language Testing 1967-1974, ed.
Palmer and Spolsky, TESOL, Wash. D.C., 1975,
pp. 220-228.
Britton, Augusto, "A Brief Review of Assessment Instruments for the Bilingual Student", CABE Evaluation
Task Force, Office of Los Angeles County Superintendent of Schools, Division of Curriculum and
Instruction, Los Angeles, Ca., May, 1975.
Carroll, John B., "Fundamental Considerations in Testing
English Proficiency of Foreign Students", Testing,
Center for Applied Linguistics, Wash. D.C., 1961,
pp. 31-40.
, "Foreign Language Testing: Will the Persistent
Problems Persist", Paper at ATESOL Conference,
June, 1973.
----::=:----.
54
Cartier, Francis A., "Criterion-Referenced Testing of
Language Skills", Papers on Language Testing
1967-1974, ed. Palmer and Spolsky, TESOL, Wash.,
D.C., 1975, pp. 19-24.
Cazden, Courtney B., "Concentrated vs Contrived Encounters:
Suggestions for Language Assessment'', Urban Review,
8.1, Spring, 1975 (a), pp. 28-34.
, "Hypercorrection in Test Responses", Theory
Practice, 14.5, December, 1975 (b),
pp. 343-346.
----~I~n-t~o
, and Others, "Language
-----::and How", Anthropology and
Assessment: Where, What
Education Quarterly,
8.2, May, 1977, pp. 83-91.
Cohen, Andrew, "The Sociolinguistic Assessment of
Speaking Skills in a Bilingual Education Program",
Papers on Language Testing 1967-1974, ed. Palmer
and Spolsky, TESOL, Wash., D.C., 1975, pp. 172-183.
Condon, Eliane, "The Cultural Content of Language Testing",
Papers on Language Testing 1967-1974, ed. Palmer and
Spolsky, TESOL, Wash., D.C., 1975, pp. 204-217.
Cowan, Weber, Hoddincott, and Klein, "Mean length of
spoken response as a function of stimulus,
experimenter, and subject", Child Development, 38,
1967, pp. 191-203.
DeAvila, Edward A. and Duncan, Sharon E., LAS Language
Arts Supplement, Spanish, Revised Edition,
Linguametrics, Corte Madera, Ca., 1977.
, Cervantes, and Duncan, "Bilingual Programs
Exit Criteria", CABE Research Journal, Vol. 1, No. 2,
September, 1978, pp. 21...,.39.
----:=--;-
De George, George P., "Guidelines for Selecting Tests
for Use in Bilingual/Bicultural Education Programs",
Paper at MATSOL Spring Conference, 1975.
Doyle, Vincent, "A Critique of the Northwest Regional
Laboratory's Review of the MAT-SEA-CAL Oral
Proficiency Tests", Paper, October, 1976.
Ehrlich, Alan, Tests in Spanish and Other Languages and
Non-Verbal Tests for Children in Bilingual Programs:
An Annotated BEARD Bibliography, RIE, August, 1973.
Fahey, Virginia K. and Others, "Heritability in Syntactic
Development: A Critique of Munsinger and Douglass",
Child Development, 49.1, March, 1978, pp. 253-257.
Fitch, .Michael J., "Verbal and Performance Test Scores in
Bilingual Children", Thesis ED.D., University of
North Colorado, 1966.
Fletcher, B., Locks, N., Reynolds, D., and Sisson, B.,
A Guide to Assessment Instruments For Limited
English Students, Santillana Publishing Company,
New York, N.Y., 1978.
Foster, R., Giddon, J., and Stark, J., Manual for the
Assessment of Children's Language Comprehension,
Consulting Psychologists Press, Palo Alto, Ca., 1972.
Garcia-Zamor, M., and Birdsong, D., Testing in ESL: An
Annotated Bibliography, Cal-Eric/CLL Series on
Languages and Linguistics, #40, January, 1977.
Gil, Sylvia, "BSM Assesses Linguistic Proficiency in
English and Spanish", Paper, 1976.
Gonzalez, Josue and Fernandez, Ricardo, "Toward the
Development of Minimal Specifications for LAD-Related
Language Assessments", Bilingual Resources, National
Dissemination and Assessment Center, Los Angeles,
Ca., Vol. 2, No. l, Fall, 1978, pp. 2-7.
Groot, Peter, "Validation of Language Tests", Papers on
Language Testing 1967~1974, ed. Palmer and Spolsky,
TESOL, Wash., D.C., 1975, pp. 137-143.
Helmer, S., "Demonstration of Assessment of Language
Dominance of Spanish Speaking Bilingual Children",
Occasional Papers on Linguistics, Paper at International Conference on Frotiers in Language
Proficiency and Dominancy Testing, April, 1977.
Hillman, R. E., A Correlational Study of Selected VocalVerbal Behaviors and the Test of ESL (TOEFL), Thesis,
Ph.D. Pennsylvania State University, 1972.
Hinofotis, F.A.B., An Investigation of the Concurrent
V~liditY ~f Cloz~ Testing as a Measure of Overall
Proficiency in ESL, Thesis, Ph.D. Southern Illinois
University, 1977.
56
Jony, Jon G., "Can't Language Testing Interface with
Language Acquisition?", Paper at TESOL Conference,
1975.
Lado, R., Language Testing: The Construction and Use
of Foreign Language Tests. A Teacher's Book, RIE,
1970.
Language Assessment Battery, Test Review in Bilingual
Resources, National Dissemination Center, Los Angeles,
Ca., Vol. 2, No. 2, Winter, 1979, pp. 40-41.
Language Assessment Scales, Publisher's Test Service,
CTB/McGraw-Hill Publishers, Monterey, Ca., 1978, p. 2.
Levine, J., "An Outline Proposal for Testing Communicative
Competence", English Language Teaching Journal, 30.2
January, 1976, pp. 128-134.
·
Luft, Max and Others, "Development of a Test Instrument to
Determine Language Dominance of Primary Students:
Test of Language Dominance (TOLD)", Paper at Annual
Meeting of American Educational Research Association,
April, 1977.
Matluck, J., and Mace, Matluck, B., "The Multilingual
Test Development Project: Oral Language Assessment
in a Multicultural Community", Paper at National
TESOL Conference, March, 1975.
McDavid, R. I., and William, A., "Communicative Barriers
to the Culturally Deprived", USOE Report, University
of Chicago, Chicago, Ill., 1966.
Oller, John E., "Discrete-point tests vs tests of
integrative skills", Focus on the Learner, ed.
Oller and Richards, Newbury House, Rowley, Mass.,
1973.
-----;- , "How Important is Language Proficiency to I.Q.
and Other Educational Tests?", Occasional Papers on
Linguistics #1, Paper at International Conference on
Frotiers in Language Proficiency and Dominancy
Testing, April, 1977.
Phillips, Judith, "The Effects of the Examiner and the
Testing Situation Up n the Performance of Culturally
Deprived Children. Phase I - Intelligence and
language ability test scores as a function of the race
of the examiner. Final Report", October, 1966.
57
Politzer, R. and McK y, M., "A Pilot Study Concerning the
Development of a Spanish/English Oral Proficiency
Test", Research Development Memorandum #120, 1974.
Puthoff, F. T., The Development of Norms for Bilingual
First Grade and Third Grade Children's Responses
to the Hand Test and Peabody Picture Vocabulary
Test, Thesis, ED.D., University of Oklahoma, 1972.
Randle, Janice A. W., A Bilingual Oral Language Test for
Mexican-American Children, Thesis, Ph.D., University
of Texas at Austin, 1975.
Robinson, Gail, "Linguistic Ability:
Some Myths and some
Evidence", Paper, Australia, April, 1975.
Robinson, Pete, "Basic Factors in the Choice, Composition
and Adaptation of Second Language Tests", Paper at
TESOL Conference~ March, 1969.
, "The Composition, Adaptation, and Choice of
Second Language Tests", English Language Tesching,
25.1, October) 1970, pp. 60-68.
------,=---
Rosansky, Ellen J., "Methods and Morphemes in Second
Language Acquisition Research", Language Learning,
26.2, December, 1976, pp. 409-425.
Rose, S. and Others, "The Development of a Measure to
Evaluate Language Communication Skills of Young
Children', Paper at Annual Meeting of American
Education Research Association, February, 1973.
Scharf, Donald, "Some Relationships between Measures of
Early Language Development', Journal of Speech and
Hearing Disorders, 37.1, February, 1972, pp. 64-74.
Silverman, H. and Russell, R., "The Relationships Among
Three Measures of Bilingualism and Their Relationship
to Achievement Test Scores", Paper at Annual Meeting
of American Research Association, April, 1977.
, Noa, J., and Russell, R., Oral Language Tests
---f"'o_r_ Bilingual Students on Evaluation of Language
Dominance and Professional Instruments, Northwest
Regional Educational Laboratory, Portland, Ore.,
July, 1976.
58
Spolsky, Bernard, "Language Testing - The Problems of
Validation", Papers on Language Testing 1967-1974,
ed. Palmer and Spolsky, TESOL, Wash., D.C., 1975,
pp. 146-153.
, Murphy, P., Holm, W., Ferrel, A., "Three
Functional Tests of Oral Proficiency", Papers on
Language Testing 1967-1974, ed. Palmer and Spolsky,
TESOL, Wash., D.C., 1975, pp. 75-87.
----==--
Sponseller, D. B. , ''Measuring Language Comprehension in
Young Children: Does the Structure of the Testing
Condition Affect Results'', Paper at Annual Meetinh
American Educational Research Association, April,
1977.
Swain, M., "Evaluation of Bilingual Education Programs:
Some Problems and Some Solutions", Paper at
Conference of Comparative International Education
Society, February, 1976.
Toronto Board of Education, "Testing Some English Language
Skills: Rationale, Development, and Description"
Paper, March, 1969.
Upshur, J., "Objective Evaluation of Oral Proficiency in
the ESOL Classroom", Papers on Language Testing
1967-1975, ed. Palmer and Spolsky, TESOL, Wash.,
D.C., 1975, pp. 52-65.
Valette, Rebecca, Modern Language Testing, Second
Edition, Harcourt, Brace, Jovanovich Inc., New York,
N.Y., 1977.
Wong-Fillmore, Lily, The Second Time Around:
Cognitive
and Social Strategies in Second Language Acquisition,
Thesis, Ph.D., Stanford University, 1976, Chapters
5' 6.
Wright, S.M., The Effect of Speaker Visibility on the
Listening Comprehension Test Scores of Intermediate
Level Students of ESL, Thesis, Ph.D. Georgetown
University, 1971.
APPENDICES
60
APPENDIX A
DESCRIPTION OF TESTS FROM A GUIDE TO ASSESSMENT INSTRUMENTS
BY PLETCHER, LOCKS, REYNOLDS, SISSON
(1978)
1977 - BASIC INVENTORY OF NATURAL LANGUAGE (BINL)
Descriptive Information:
Purpose:
To assess a student's language dominance and
proficiency in Spanish and English.
Score Interpretation:
This instrument yields raw scores
in English or Spanish which represent a student's
fluency and level of language complexity.
Models are provided for the development of local
norms.
Grade Range:
K-12 (reviewed for grades K-6)
Target Ethnic Group:
General (reviewed for Cuban,
Mexican-American, and Puerto Rican; see also
Comprehensive Index for other language versions
under development).
Administration Time:
From 10-15 minutes: not timed.
Administrator Requirements:
The administrator should be
proficient in the languages in which the
instrument is administered and should have a
knowledge of simple grammar.
CHECpoint Systems
also sponsors a one-day training workshop for
test administrators.
61
Author:
Charles H. Herbert
Source:
CHECpoint Systems, 1558 N. Waterman Avenue,
Suite C, San Bernardino, California 92404
Cost:
A kit with 80 talk tiles, 40 story starter
pictures, 1 spirit duplicating masters book,
1 instruction manual, 400 oral score sheets,
100 profile sheets, and 2 class profile cards
costs $85.00.
A tape recorder is also required.
This criterion-referenced instrument makes use of story
sequence pictures and talk tiles to elicit natural speech
samples.
Instructions are given orally in English or Spanish.
Students respond by telling a stody.
hand or machine scored.
Taped answers are
Individual administration is
required.
Technical Information:
Although sample normative results have been reported for
300-400 students in the lower elementary grades in
California, reviewers felt that this instrument is
basically a criterion-referenced test.
Reliability
measures are not yet available, and the only validity
measure reported indicated that the language complexity
subscores tended to rise from grades K-2 in both firstand second-language growth.
Reviewers suggested that
this instrument be used as a diagnostic rather than an
achievement test until more meaningful norms are
established.
The administrator's manual contains much
information, although it provides no specific oral cues
for the administrator in English or in Spanish.
The hand
scoring procedures are fully described, but reviewers
found them to be quite complex.
Cultural and Linguistic Information:
Reviewers found the item content and vocabulary to be
appropriate, with minor revisions, for use with Hispanic
students in grades K-6.
Reviewers commented that the
talk tiles contained no stimulus objects which reflected
Hispanic culture, and that the sequence stories had no
stimulus words in Spanish.
The illustrations, although
quite interesting, seemed to depict minority cultural
groups as country folk while showing "Anglos" in urban
settings.
Nevertheless, the format and procedures were
found to be highly acceptable for use with Hispanic
children.
1975 - BILINGUAL SYNTAX MEASURE (BSM) MEDIDA DE SINTAXSIS BILINGUE
Descriptive Information:
Purpose:
To measure syntactic proficiency by eliciting
natural speech samples
i~
Spanish and English.
63
Score Interpretation:
This instrument yields hierarchical
scores which place a child in 1 of 5 proficiency
levels in English and Spanish.
Instructional
suggestions are provided for each proficiency
level.
Grade Range:
K-2
Target Ethnic Group:
General (reviewed for Cuban,
Mexican-American, and Puerto Rican; see also
Italian and Tagalog entries)
Administration Time:
Approximately 15 minutes; not timed.
Administrator Requirements:
The administrator should be
proficient in Spanish if the Hierarchical
Scoring is used, and proficient in Spanish and
English if the Syntax Acquisition Method is
used.
Authors:
Marina K.
Burt,
Heidi C. Dulay, and Eduardo
Hernandez Chavez
Source:
The Psychological Corporation, 757
Third Avenue,
New York, New York 10017.
Cost:
A Spanish-English kit containing a Picture Booklet,
2 manuals, 70 Child Response Booklets, 2 Class
Record Sheets, and a Technical Handbook costs
$50.00.
Similar, but not parallel, Spanish and English forms of
this instrument measure a student's command of basic
64
English grammatical structures regardless of his pronunciation or general knowledge.
If both English and
Spanish versions are administered, the test may serve
as a language dominance measure.
The administrator asks
the student 25 questions relating to 7 pictorial stimuli.
Students respond orally; responses are recorded by the
examiner and are hand scored.
Individual administration
is required.
Technical Information:
The authors have collected data on the use of the English
version of this instrument with 1371 students, and on
the use of the Spanish version with 1146 students, all
of whom were in grades K-2 in 4 geographic regions of the
United States.
norms.
These data are illustrative and are not
The authors provided reliability data based on
150 Spanish-speaking students.
Reviewers felt that the
Kappa coefficients of .40 were low and questioned the
use of factor analysis to demonstrate the validity of
the BSM.
The manual states that English proficiency
scores on this instrument improve as a function of the
students' time in the United States.
However, the
tabular data presented show some inversions as a function of time in the United States.
There does seem to
be a clear difference in performance between those
students here 3 year or more and those who have been
65
here a shorter time, but reviewers felt that the BSM
could be used to measure gross differences in language
proficiency.
They also noted that the administrator's
manual was somewhat repetitive and commented that
administrators should be required to attend training
sessions.
Cultural and Linguistic Information:
Hispanic:
Reviewers found the directions, item content, vocabulary,
format, and procedures to be culturally and linguistically
appropriate for Cuban, Mexican-American, and Puerto Rican
students in grades K-2.
They felt that the illustrations
were excellent.
1975 - LANGUAGE ASSESSMENT BATTERY (LAB) Levels I-III
Descriptive Information:
Purpose:
To assess a student's reading, listening
comprehension, and speaking skills in English
and Spanish in order to determine language
dominance.
Score Interpretation:
This instrument yields stanine
scores and percentile ranks by grade.
Grade Range:
K-12.
Level I, grade K-2;
Level II, grade
3-6; Level III, grades 7-12 (reviewed for
grades K-6) .
Target Ethnic Group:
Hispanic (reviewed for Puerto Rican)
Administration Time:
Level I, from 5-10 munites;
Level II, approximately 41 minutes;
timed.
Administrator Requirements:
The administrator should be
proficient in the language in which the
test is administered and should be thoroughly
familiar with the examiner's manual.
Author:
Office of Educational Evaluation of the Board
of Education of the City of New York.
Source:
Houghton Mifflin Company,
Test Department,
P.O. Box 1970, Iowa City, Iowa 52240.
Cost:
Each test booklet and examiner's manual costs
$.34 per copy for Level I and $.49 for
Level II.
The technical report costs $3.25.
This norm-referenced instrument is composed of parallel
English and Spahish versions.
level II contains 92 items.
administered in English.
Level I contains 40 items;
The instrument is first
The administrator then uses
the Spanish version to test students who scored below a
designated cutoff point.
Students respond orally, by
pointing, by writing in the test booklet, and by marking
answer sheets (on Level II only).
Individual administra-
tion is required for Level I and for part of Level II.
Technical Information:
The developers have established separate norms for the
English and Spanish versions of the LAB.
The English
norming sample consisted of 12,532 monolingual students
and the Spanish sample 6,271 Spanish-speaking students,
all enrolled in grades K-12 in New York City.
Reviewers
felt the norms were well developed and well reported.
The developers report reliability coefficients and
standard error of measurement for all levels and subtests
of both versions of this instrument.
After studying the
learning objectives provided for each subtest, Guide
reviewers stated that the instrument had good face
validity.
Cultural and Linguistic Information
Reviewers found the vocabulary, format, and procedures
for Level I to be culturally and linguistically appropriate for Puerto Rican students, in grades K-2.
However,
they felt that the item content in the reading section
was inappropriate for students in grades K-2 because
many items required abstract reasoning as well as reading
ability.
Reviewers commented that the speaking section
did not adequately test speaking since it only required
the child to produce one-word responses.
Reviewers felt
that because of the computerized answer sheets, the timed
nature of the reading tests, and the fine auditory
discrimination required for the listening tests,
Level
II might not be appropriate for Puerto Rican students
recently arrived in the United States.
1975 - LANGUAGE ASSESSMENT SCALES (LAS) English Version - Level I
Descriptive Information:
Purpose:
To assess a student's listening and speaking
skills in English.
Score Interpretation:
This instrument yields a total
converted score which is tied to l of 5
proficiency levels.
Grade Range: K-5
Target Ethnic Group:
General (reviewed for Cuban,
Mexican-American, and Puerto Rican)
Administration Time:
Approximately 20 minutes; the
prerecorded tape is timed.
Administrator requirements:
The administrator should
have native proficiency in English.
Authors:
Source:
Edward A. De Avila and Sharon E. Duncan
Linguametrics Group, P.O. Box 454, Corte Madera,
California 94925
69
Cost:
This instrument is sold with a Spanish-language
A LAS
version, also reviewed in this Guide.
examiner's kit which includes administration
and scoring instructions, pictorial stimuli,
a Spanish-English audio cassette, and 100
English and 100 Spanish score sheets costs
$48.00.
Additional score sheets are available
at $5.00 per 100.
These diagnostic instruments contain 100 items designed
to assess phonemic production and discrimination, lexical
production, sentence comprehension, oral production
skills, and a student's ability to use language to attain
specific goals.
Instructions are given orally, and item
stimuli are either taped or pictured in the test booklet.
Students respond orally or by pointing.
scored.
Answers are hand
Individual administration is required.
A
Language Arts Supplement containing follow-up learning
activities and language games related to each test item
is available from the publisher.
Technical Information:
The authors standardized this instrument with 308 5- to
12-year-old English-speaking students and report
moderately high inter-rater reliability coefficients
ranging from .84 to .94.
Since this instrument measures
70
language proficiency, scores on this instrument show a low
correlation with age.
The reviewers gave excellent rat-
ings to the administrator's manual because of the clarity
of instructions and the ease with which directions could
be delivered during a testing session.
Cultural and Linguistic Information:
Reviewers felt that the directions, item content, format,
and procedures were highly appropriate for use with
English-speaking Cuban, Mexican-American, and Puerto Rican
students in grades K-5.
They felt that some illustrations
were too crowded and that others were too simplistic for
fifth-grade students.
APPENDIX B
EXAMINERS' PERSONAL REACTIONS TO AND COMMENTS
UPON THE TESTS AND TESTING PROCEDURE:
Examiner I:
(She is a bilingual language arts
specialist.)
Her comments:
"In an overall evaluation all four
inclusive instruments were moderately valuable in
structure and content.
The BINL was easily admin-
istered and afforded each individual pupil an
opportunity to respond according to their particular
capabilities and experiences.
The pictures were
motivating and useful.
The LAB was the least complicated and required
little motivation or stimulus.
concise.
It was explicit and
On the contrary, the LAS, I felt, was too
general, covered too many areas, and was exhaustive
to both examiner and student.
It created a high
frustration level for most of the participants.
The students became very hesitant and inhibited.
Also the test was too long for one sitting.
The BSM was easy to administer and the content
was more motivating than any of the other instruments
in comparison.
The material was relevant, inspiring
and well presented.
Of the four tests I felt this
72
to be
the most valuable in all aspects."
- Elizabeth Najarian
Examiner II:
(She is a bilingual language arts
specialist with the California Bilingual/
Crosscultural Specialist Credential.)
Her comments:
''BINL:
this test was easy to give
to NES pupils - but if they couldn't generate any
English responses, they might have fest incompetent
andjor mystified as to why they had to take this
king of test in the forst place.
Those pupils who could speak English to varying
degrees, were able to generate English responses,
but frequently all that they said were repetitions
of,
'(a/the) boy, girl' or, at a higher level,
'I see a
, ' over and over again, using different
nouns to fill in the blank.
Fluent English speakers, of course, could
usually find things to describe with ease.
(In
other testing, the BINL, given in both English and
Spanish, was quite helpful in determining the
language dominance of several puzzling cases of
seemingly 'equally bilingual' pupils.)
As for the testing conditions and scoring,
recording the BINL test was difficult to give in
regular classrooms as well as when given in a small
room with only a few pupils who were present (and
who served as distractors).
The best and much less
time-consuming testing was done in a small room with
only the tester and the pupil present.
Recording
the test was time-consuming in the cases of poor
and good English speakers alike; selected responses
often had to be replayed again and again to hear
the specific words which the pupils had utilized in
their responses.
Scoring procedures were technical
but gradually I felt more at ease in determining
which words to count, etc.
Most of the pupils
enjoyed hearing themselves on a tape recording.
BSM:
This test had colorful pictures which attracted
the attention of many pupils.
The test design showed
that someone tried to be more humane than usual
towards the test-taker; since all responses were
not recorded, just selected ones, the pupils got the
feeling that the tester was interested in their
responses, not just in 'recording the answers' to
their test.
Unfortunately, if a pupil was truly NES, he/she
couldn't benefit much from the test since he/she
had to generate answers related to specific 'given'
vocabulary.
At least NES pupils could stop taking
the test after a very few questions had been posed.
More fluent speakers of ESL had more success
with this test at times - and most pupils liked
this test the best - or second best (vs the excit
ment of being tape-recorded for the BINL).
This
test does measure a student's ability to give
specific responses and sentence patterns - but
limited ESL speakers might not be able to demonstrate their true ability to speak (understand)
English based solely on this test.
Recording
students' responses was not much of a problem wit
this test.
LAS:
This test has its good points which will he
to diagnose and prescribe remediation for pupils'
needs (i.e., phoneme recognition and pronunciatic
vocabulary recognitions; generating vocabulary -
naming items; and comprehension of detailed phraE
It would be an excellent test to give to a FES o!
PES level language speaker.
But for NES and LES
pupils, this is a difficult, time-consuming test
which, to get any reasonably 'true' assessment,
must be done in an isolated area, i.e., the teste
and the pupil need to be alone together in a quie
room where there is no distraction from people
from noise
01
hardly what one might achieve ir
normal classrooms or even small group testing arE
75
environments.
The test is thorough, ending with the
pupils having to listen to a rather strange story
and, while looking at somewhat related pictures,
retell the story
. an impossibility for all the
NES and many LES pupils.
The testing procedure is awkward, switching
from a tape to a booklet to a tape to the booklet
with spiral binding, with pages that have to be
quickly flipped, then turned over, during a taped
portion of the test .
at the same time during
which the tester is to be marking down the pupil's
responses . . . it's a mess!
And the recording
procedure is also misleading should a pupil see
his/her score sheet:
only incorrect items are
marked, thus a pupil may get positive feedback from
the incorrect answers . . . !
Ugh!
Don't choose
this test for any large scale testing to determine
pupil dominance.
LAB:
This test, especially LAB I, was easy to give
and seemed to
hav~
some items which many pupils
could answer, be they NES, LES, or FES.
The vocab-
ulary and test content was much more 'school-related'
and closer to the vocabulary utilized with beginning
ESL pupils, than any of the other tests which were
used in this study.
This possibly gave the pupils
76
more confidence - they certainly seemed to do better
on this test than on the others.
The testing
procedure and the other of the test items was a bit
awkward at times (the procedures change after 3-5
questions per page) but I liked the fact that all
responses were marked either 'yes' or 'no', so that
any response was counted, (as was no response).
The computer scoring sheet had the test
questions numbered in a way which was confusing to
me at times - but this could easily be remedied.
I'd choose the LAB for determining language dominance
if it were paired with a BINL-like test of a pupil's
ability to freely generate phrases in English and
in the other language (Spanish)."
- B. Rheingold Gerlicke
Examiner III:
(She is a student in bilingual education
at California State University, Northridge
and recommended by a member of my
committee.)
She comments:
"When I first began with each child
I first spoke to them about some activity or other
unrelated to the test.
up to me.
This helped the child warm
I also would explain that it wasn't really
a test in the sense that he would pass or fail.
That it would just show where we, as teachers, would
77
find out where we could help them a little more.
This helped ease their curiosity or any fear they
might have in failing.
Because of this, I felt
that each child did his best.
In administering the four tests all at once
I was able to see which I felt was more effective
in determining the fluency of a child in the English
language.
I as also able to see the advantages and
disadvantages of each.
The BSM test was the one I
felt less comfortable with because I felt the
children were uncomfortable with it.
They enjoy the
colorful pictures but it is very limited and does
not measure the child's fluency very accurately.
A child may be very fluent but make grammatical
errors.
Also each question asked for a specific
answer.
The children were more interested in what
I was writing.
The LAB was a short quick test but also very
limited.
questions.
Many children could not answer the
If they would have been rephrased, I'm
sure there would have been more of a response from
the child.
the BINL.
I really felt the most comfortable with
I could feel the children more at ease.
I think they felt this way because I wasn't writing
away.
They soon forgot that we were using a tape
78
~ecorder.
The tape recorder made a big difference
to me as the administrator of the test.
I felt like
I was having more of a conversation than a test with
each child.
I think the children felt this also.
One big advantage to this test is that there are no
limitations:
the child uses the words he knows, and
can also use his imagination if he wants to tell a
story.
For fluency I feel this is the best test.
But having worked in a classroom, I found I liked
the LAS very much.
I feel this isn't an appropriate
test for determining fluency, but more of a test on
phonics, discrimination, auditory and memory skills
or abilities.
I would use this as a teaching aid.
Each item of each section is numbered so the teacher
can give supplemental work where ever the child may
need it.
The only disadvantage I found in using
this test was that frequently the child could not
understand the voice on the tape.
In giving this
test, I would prefer the BINL test and the LAS test
to be the most effective.
I would use both together;
one to measure fluency and the other to find out in
exactly what areas the child needs help, and use the
worksheets as follow-up or teaching aids.
I also
feel I should add that I strongly feel that the
person administering the test makes a big difference
79
in the outcome of each test."
- Elena Romero
80
APPENDIX C
Inasmuch as each of these tests was designed
independently and the scoring systems were not necessarily intended to assign students to the four proficiency
categories established by the Los Angeles Unified School
District under the Lau Plan, i.e., Non-English Speaking
(NES), Limited-English Speaking (LES), Functional-English
Speaking (FES), and Proficient-English Speaking (PES).
For the purposes of this
study~
it was necessary to
interpret the raw scores according to these categories.
Even though there were cut-off points established by the
test designers which logically divided the proficiency
scores into 4 or 5 points along a continuum, the test
designers did not, in all instances, interpret these
categories as NES, LES, FES, or PES.
We have taken the
liberty of doing so for purposes of comparison.
BINL SCORE RANGES
Grades
NES
LES
FES
PES
K-2
0-24
24.1- 52
52.1- 78
78.1-200
3-6
0-24
24.1- 78
78.1-101
101.1-200
7-8
0-24
24.1- 78
78.1-101
101.1-200
9-12
0-24
24.1-101
101.1-130
130.1-200
81
BSM SCORING RANGE
Syntax Acquisition Index (SAI)
Level 1
Monolingual Spanish
No response
Level 2
Spanish Dominant
Respond in Spanish
Level 3
Survival level
SAI: 46-84
Level 4
Intermediate
SAI: 85-94
Level 5
Proficient
SAI: 95-100
LAS SCORING RANGE
Level 1
Minimal Production
54 and below
Level 2
Fragmented Production
55-64
Level 3
Labored Production
65-74
Level 4
Near Perfect Production
75-84
Level 5
Perfect Production
85-100