CALIFORNIA STATE UNIVERSITY, NORTHRIDGE A COMPARISON OF FOUR ORAL LANGUAGE TESTS A thesis submitted in partial satisfaction of the requirements for the degree of Master of Arts in Elementary Education, Bilingual Bicultural Education by Susan Colman Acosta January, 1981 The Thesis of Susan Colman Acosta is approved: Dr. Aug~ritt~ ~Ri-o-- Dr~a~owicz Dr. C. Ray Graham California State University, Northridge ii ACKNOWLEDGEMENTS This has been a long and hard process which has been smoothed by the help and support of many people. First of all, my family has endured much and I thank them. Next, it would have been impossible without the help of my Committee Chairman Dr. C. Ray Graham. contributed: Many others Ethel Cullom, who loaned me research material; Dr. Augusto Britton and Dr. Ed Labinowicz of my Committee; Dr. Robert Effler, Principal of Miramonte Elementary School and Ms. Irene Curtis of Area 2, LAUSD, who gave me permission to use the children of Miramonte as the subjects in this study. The study could not have taken place without the consent and cooperation of the following teachers at Miramonte: First grade - M. Merino, P. Reding, M. Block, J. Miller, J. Lawrence, B. Earhardt, S. Sutton, C. Korten; Third grade- T. Dooley, D. Martinez, J. Mickelberry, P. York, K. Bentson, G. Smoot, S. Orange, and T. Lopez. A special thanksgoes to my three examiners who helped me make it happen: Elena Romero, Barbara Gerlicke, and Elizabeth Najarian. Lastly, I must also express my appreciation for the loving encouragement over the years that my parents, Edward and Mary Colman have expressed. I. only wish my mother were still alive to enjoy the success of her daughter. iii TABLE OF CONTENTS Page iii ACKNOWLEDGEMENT vi ABSTRACT Chapter I. II. III. IV. V. INTRODUCTION 1 Background 2 Statement of the Problem 3 Definition of Terms 3 ~ . REVIEW OF THE LITERATURE 6 Purpose of Tests 6 What Is Proficiency 7 How Is Proficiency To Be Measured 12 The Four Tests 16 Practicality 25 DESIGN AND PROCEDURE OF THE STUDY 30 Tests . . 30 Subjects 31 Design 31 Procedure 32 Statistical Analysis 33 FINDINGS OF THE STUDY . 38 SUMMARY, CONCLUSIONS AND RECOMMENDATION 47 Recommendations . . . . . . . . . . . 51 Page BIBLIOGRAPHY 53 APPENDICES . 59 ABSTRACT A COMPARISON OF FOUR ORAL LANGUAGE TESTS by Susan Colman Acosta Master of Arts in Education A study was made comparing four oral language tests designated by the State of California for use in classifying the English language fluency of children whose home language is not English. The four tests were: Basic Inventory of Natural Language (BINL), Bilingual Syntax Measure (BSM), Language Assessment Battery (LAB), and Language Assessment Scales (LAS). None of the tests was fully adequate to fulfill the testing•and classification requirements of LAU versus NICHOLS. The BSM and the LAS appeared to be the better of the four tests based on the correlations made, the percentage of agreement on placement and for practical considerations. More research into the evaluation of child language acquisition is indicated. vi CHAPTER I INTRODUCTION Since the advent of Spanish/English bilingual education programs in Dade County, Florida, in the early 1960's, there has been a growing need for instruments to measure oral language proficiency in a variety of languages. As a result of the LAU versus NICHOLS Supreme Court decision in 1974, state and federal laws require that students be instructed in a language they can understand. Moreover national policy as established in the Bilingual Education Act of 1974 calls for a national assessment to identify limited - and non-English speaking children for purposes of carrying out bilingual education programs in the United States. There is a consequent need to be able to assess student's language proficiency/ dominance in order to place them in appropriate programs. In response to this need, there has been a proliferation ' of tests for measuring every aspect of bilingual children's performance, especially in oral language. A review of bibliographies of current assessment instruments for bilingual programs reveals more than 100 tests for the 1 measurement of language proficiency/dominance alone, either published or in development (Center of Bilingual Education, 1978, and Fletcher, Lock, Reynolds and Sisson, 1978). Background In 1978, the State of California Department of Education designated four English language proficiency tests to be used in the assessment process required by LAU versus NICHOLS. The four tests are: Basic Inventory of Natural Language (BINL) 1977, Bilingual Syntax Measure (BSM) 1975, Language Assessment Battery (LAB) 1976. Language Assessment Scales (LAS) 1975, 1977. Each of these tests has been examined and discussed individually by one or more of these researchers, Britton (1975), Center for Bilingual Education (1978), DeAvila and Duncan (1977), Gil (1976), Helmer (1977), Politzer and McKay (1974), Fletcher, Locks, Reynolds and Sisson (1978), and Randle (1975), but there is a need to compare all four to determine if they measure the same skills. School districts in California are currently faced with the selection of one of these four tests to satisfy federal and state requirements. If they do not choose one of the four, they must justify the choice of a 3 different test. There is a need for more information based upon a comparison of all four tests to help in the selection process. Statement of the Problem This study addresses itself to this need. Specifi- cally it seeks to answer the following questions: Does each of the instruments render comparable English proficiency scores for children learning English as a Second Language? Is there a difference between them that would make one test more reliable or valid than another? This thesis compares, through correlational statistical analysis, all four tests, using a sample of primary grade children to whom all four tests were administered. The resulting information has potential value to those faced with the task of choosing one of these tests. By considering the specific requirements of a particular district or school, the test that will be most appropriate can be chosen. Definition of Terms Proficiency is a complex concept and has not been well defined in the literature. In the broadest sense, proficiency refers to the ability to use language in a multitude of contexts, both productively and receptively, in oral as well as written skills. In this thesis, the use of the term is limited to oral proficiency, i.e, the individual's ability to understand spoken language and to speak fluently. In practice, proficiency has been divided into the following categories for state and federal programs in California: Non-English Speaker (NES) Limited-English Speaker (LES) Functional-English Speaker (FES) Proficient-English Speaker (PES). Dominance commonly refers to an individual's preferred language, or to relative proficiency in two or more languages. In the present discussion, dominance will refer to the comparison of speaking and listening skills in two or more languages. The exact definitions of these terms, as will be seen in a later discussion, vary tremendously and are determined by the individual instrument under consideration. Validity refers to the extent to which a test measures what it purports to measure. There are several kinds of validity, among which are: Face validity, or how the test appears on the surface to the examiner and subject. Content validity, or how well the test covers the 5 subject area being tested. Construct validity refers to the extent to which a test measures a theoretical construct or trait. Concurrent or criteria-related validity has to do with the extent to which a subject 1 s performance on the test correlates with some external criterion such as his/her observed performance on another test which purports to measure the same traits. Reliability pertains to the dependability of the scores which a test yields. In other words, how stable and consistent are the scores realized from the test. Practicality refers to such varied aspects as the skill and time required to administer and score the test, the cost of administering and scoring it, and the need for special equipment. CHAPTER II REVIEW OF THE LITERATURE Purpose of Tests One of the first things that must be decided in choosing a test is what is to be tested and why. There have been many attempts to develop assessment instruments designed to measure various aspects of language acquisition in children. Robinson (1970) and Aitken (1975) have identified several purposes for such tests: 1. survey to gather information about second language competence and evaluation of whole programs; 2. research into effectiveness of different teaching methods, manuals, and audio-visual aids; 3. research into psychology dealing with an individual; 4. research into sociology dealing with groups; 5. evaluation of particular progress concerning a. aptitude b. diagnosis c. prediction d. achievement e. classification f. proficiency. Bilingual programs implemented under state and federal laws might legitimately need tests for a number of the purposes identified above, but the primary need is in the area of assessment of proficiency or dominance for purposes of program placement. This thesis focusses on instruments, specific instruments, designed to assess English proficiency in children. What is Proficiency Structural linguists have viewed language as being able to be analyzed and divided into sub-categories such as semantics, syntax, and phonology. Macnamara (1967) developed a matrix for the language arts utilizing this type of approach: (Fig. 1). Listening Speaking Reading Writing semantics semantics semantics semantics syntax syntax syntax syntax lexicon lexicon lexicon lexicon phonemics phonemics graphemics graphemics Figure 1. Language Arts Matrix This view of language has led many test makers to design tests which arrive at a subject's overall proficiency in a given language by determining his proficiency in each of the sub-areas (i.e., phonology, syntax, etc.) and then weighting each sub-area to come up with a numerical proficiency rating. Within each sub-area, skills are often measured by isolating a particular feature (e.g. formation of the past tense, pronunciation of the [r], etc.) within each test item. This sort of test has been called a discrete point test. Aitken (1976) connects the audio-lingual method of language instruction with discrete point approaches and asserts it is based on two erroneous assumptions: 1. The surface structure of a language can be systematically described and its elements listed and compared with any other language, similarly described. 2. The mastery of a language may be divided into the mastery of a number of separate skills: listening, speaking, reading, writing. These skills in turn may be divided into a number of distinct items. It assumes that to have developed a criterion level of mastery of skills and items listed for that language language. (pp. 7, 8) The P,roblem with assumption (1) is that it is impossible to make a list of all the items in a language. would lead to long and unwieldy tests. This Assumption (2) does not take into account the role of previous personal experience in the assignment of meaning in language usage. There is, according to Carroll (1961), a redun- dancy factor to natural language that allows one to predict missing elements from the context (i.e., the cloze procedure) making it difficult to say that any given language item is essential to communicate or to establish the functional load to any system. In other words, a test of many isolated and separate points of grammar or lexicon is not a real test of language, Oller ( 1973). Levine (1976) has pointed out that often discrete item tests are used because they are technically objective and simple to administer and score. They do, however, provide certain types of "information about the state of a learner's knowledge about a language." The caveat is that they do not necessarily provide information about the ability to understand and use language appropriately in context. She contends that a kind of vicious cycle has grown up around discrete point testing. If a test is discrete, then the instruction should be similarly discrete. If mastery is shown by paper and pencil, then this is the way it is learned. Therefore, if this is how language is taught, then this is what "knowing" a language must be. There are a number of other viewpoints against discrete type testing that generally agree with the arguments already presented, Jony (1975), Bordie (1970), .LV Cazden (1975b), and Spolsky, Murphy, Holm and Ferrell (1972). Sociologists, however, have been reluctant to attempt to separate a subject's "knowledge" of a particular feature of his language (i.e., competence) from his use of that feature in normal communication (i.e., performance). In particular they have pointed out that an adequate measurement of a person's linguistic repertoire would include, among other things, considerations of what language is being used with whom in what context when discussing what particular content material. Communicative ability, according to Groot (1973), consists of "linguistic and non (para, extra) linguistic components." He points out that the descriptive linguistic components of tests are often considered to be valid representation of a student's overall language proficiency. However, most of the evidence indicates that communicative ability is more than the sum-total of these linguistic components. Or, overall language proficiency is more that the knowledge of vacabulary, syntax and • phonology. (p. 138) This view of language argues for the other type of language test called an integrative test. If a discrete form of test focusses on isolated linguistic units, then an integrative type has the subject perform tasks similar to those that occur in real life. 11 In distinguishing between discrete point tests and integrative tests, Aitken (1975) stated, Discrete point tests are based on the assumption that there are a given number of specific structure points, the mastery of which constitutes 'knowing' a language ... an integrative test is one based on the premise that 'knowing' a language must be expressed in some functional statement ... (p. 7) that taps communicative competence factors. Valette (1977) chose to call the integrative type of test a global language test because it measures the student's ability to understand and use language in context. Valette (1977) developed a matrix contrasting the two types of tests: (Fig. 2) Test type Item type Competence Performance DISCRETE point tests multiple choice/short answer items linguistic competence formal performance objectives GLOBAL tests communication items communicative competence open-ended performance objectives Figure 2. DISCRETE point and GLOBAL tests Carroll (1961) and Spolsky (1968) propose that: 1. >'' ~t discrete point tests should be used for controlling instruction, deciding what is ~'·":~.· :. ' ·.-·~.: l . ' ' to be taught, and how well something has been learned; 2. integrative tests should be used for proficiency purposes. If one accepts the sociolinguist's view of language, and there is good evidence that elicited speech differs radically from spontaneous speech in children learning English as a Second Language (Wong-Filmore, 1976), then a child's total language proficiency would include his/her ability to understand and express himself/herself in every conceivable communicative situation. include using language for: This would 1) performing different functions (e.g., giving directions, asking questions, describing things, narrating experiences, telling stories, etc.); 2) communicating with different people; 3) communicating appropriately in different social contexts (e.g., formally, informally, intimately, etc.); and 4) communicating about different topics (e.g., home, school, neighborhood, animals, plants, etc.). Obviously to sample a child's performance in all of those areas would be totally impractical if not impossible. How Is Proficiency To Be Measured As can be seen from the above discussion, language proficiency is an extremely complex phenomenon whose 13 measurement requires sampling of behaviours along a number of dimensions. The task of the test designer is to develop a procedure which will insure that the language samples collected will be representative of the child's entire repertoire of important linguistic abilities. This does not necessarily mean that the instrument must sample each language behaviour, but in order to be a valid measure of language proficiency, the skills measured must bear an identifiable relationship to all of the abilities necessary to the subject's overall language proficiency. If proficiency is to be adequately measured, instruments must meet certain criteria of validity, reliability and practicality. There are two kinds of approaches to validity: logical analysis in which one tries to judge precisely what the test is measuring, such as content validity; and empirical analysis in which one tries to relate the test to a criterion known or assumed to measure some characteristic, such as concurrent and predictive validity. Construct validity uses both logical and empirical analysis. Oller (1977) warned that it is possible that a test labelled as a proficiency test might really be testing and classifying I.Q. or some other educational objective, such as reading. Hillman (1972) sees validity used as a predictor of certain vocal-verbal features conditioned by circumstances surrounding those vocal-verbal features. There is research by McDavid (1966) which indicated that stress, intonation, pitch and associated paralanguage gestures are more indicative of language ability than any of the other usual language characteristics normally thought significant for measurement, such as syntax, vacabulary, grammar, and so on. (p. 53) He indicated that although tests are reliable instruments which can measure accurately and consistently, they may not be measuring the right thing. In fact, many language tests really only measure written English with little oral production, as in the Language Assessment Battery. Scharf (1972) was concerned about the variability in the rate of increase in language which makes the use of age, as a basis for comparing children, a poor one, since they do not develop consistently. Children may make reversals as a part of their normal acquisition of language. This concern is relevant for this study because all the tests under consideration in the study had either different form~ of the test given according to age, or used age as a factor in assigning a linguistic category according to a score, as occurred with the BINL. Another factor which must be taken into account is the interaction between examiner and child. Various researchers have found that the type of stimulus materials, 15 among other things, can have an effect upon the child, Phillips (1966), Cohen (1975), and Condon (1975). Similarly, age, sex, and socio-economic status of the child can have an effect, Cowan, Weber, Hoddincott, and Klein (1967). According to Swain (1976), certain aspects of language are easier to measure: What rarely are measured are those aspects of language which are difficult to measure ... becuase they are not well enough understood to develop a relevant test, or because the collection and analysis of the data are simply too time consuming. (pp. 13, 14) In other words, usually language tests measure skill aspects of language rather than communicative, creative, or aesthetic aspects. Often, too, ntests tend to be the only accepted means of obtaining performance data." Reliability is important also. one needs: To be dependable multiple samples of the skill being tested; standard tasks, so that all subjects are required to perform the same task; standard conditions of administration; and standard scoring or interscorer reliability. I Practicality is the other factor that is germane to the selection of a test. Considerations of time, expense, skills needed for administration and scoring, and ease of administration are all things that can not be ignored. A test may be valid and reliable, but it it costs too much, it will not be used. .. · Similarly, j_f it Hi is awkward to administer or difficult to score, it will not be used. To summarize, an instrument needs to be valid in what it tests and how it tests for that objective; it needs to be reliable so that it can be used consistently; it needs to be practical in terms of cost, time, effort and skill needed to administer and score it. The Four Tests In view of the previous discussion in this chapter, it is appropriate to examine what has been written about the tests in the study. There were few, if any, published references that dealt with the four tests. Most often the tests were mentioned as existing, but with no critiques, only usually factual descriptions. Appendix A) (See The test information that is available is primarily limited to accompanying technical manual written by the developers. A matrix was developed to compare the four tests used in this study. (See Figure 3). BINL LAB P'-1 t:"'i t:"'i Listening H 12.5% of score 35 items phonemes (/} t-3 BSM 12.5% of score 30 items sound discrimination minimal pairs PHONOLOGY (/} LAS Production trJ (/.l t-3 trJ t:J lJj lo,:j 1-<l f-'• lo,:j ~ 0 c:::: aq "i CD LEXICON Listening ::0 0 w ::0 ::t> t:"'i Production t:"'i ::t> ~ 35% of 20 itmes 7 responses to responses to questions pointing to picture of body part 12.5% of score 10 items identify correct picture in response to taped statement 50% of 20 items identifying 'parts of body, pictures 12.5% of score 20 items identify pictures Q c:::: ::t> Q trJ SYNTAX (50% see below) taped story with 4 guide pictures Listening t-3 trJ (/} t-3 (/.l Production 100% 10 spontaneous samples LAB I 15% of 20 items 3 responses to questions LAB II 100% of 14 items elicited response 50% of score retell story heard on tape 100% 26 responses to questions BINL The BINL, Herbert (1977), as shown in Figure 3 measures oral production skills. The discrete point approach to language testing is rejected by the author, Herbert. His orientation is towards what he terms natural language. Therefore no particular structures or elements are sought and questioning is not allowed during the language sampling except as it might come up in a natural conversation. The act of describing or telling a story about a picture is considered to be a natural situation. Ideally the test is meant to be tape recorded as a conversation between peers and the role of adult examiner is supposed to be minimized. not been true in actuality. This has Herbert stated that Evaluating language production involves two major aspects of language ability. l. We must consider the language competence of the child; the language the child has at his command to express his thought. 2. We must measure the language production of the child; that which he says. (p. 6) Validity The manual indicates three validity studies that were done. The first study dealt with the scoring system. The contention was that there was a positive correlation between the average sentence length to the level of complexity. The sample size was not large (182 English Dominant Speakers and 160 Spanish Dominant Speakers) and no theoretical basis was given for making that type of correlation. The second study relied on a single cor- relation done by Fresno Unified School District about the BINL's validity for determining Dominance and Proficiency. not indicated. The particular value for this study was The third study compared the BINL Oral Language Complexity score with the Gilmore Oral Reading test, which consists of graded paragraphs of increasing difficulty which are read aloud, and found a relationship between the two tests. Nor reason for the use of that particular test for correlational purposes was given. A Spearman-Brown split half coefficient was also taken that concluded that there were consistent levels in oral language complexity across the 10 sentences spoken for those students dominant in either language. Nothing was indicated about those who were LES. The claim is made that syntactic complexity parallels language development in children and there is a direct correlation between the two~ However, Labov (1973), Pope (1974) and Goldberg (1972) found that complexity of utterances can be influenced by the task, i.e., the particular topic being discussed. The semantic content is not measured. 1 It does not matter what a child says as long as it is grammatically 20 correct, i.e., a child might say, "I see a dog" when it is an elephant in the picture. The only task is one of storytelling. There is no indication whether this task correlates with other ones such as giving directions or asking questions. Reliability The testing situation lends itself to many interpretations. Cazden (1975) and Phillips (1975) had doubts about the reliability of tests using this procedure because the testing situation, in and of itself, lends itself to distortion, due to anxiety and hypercorrection, since most testing situations are contrived ones. There are other sociolinguistic variables also, such as the race of the examiner, that have not been considered. The manual mentions that a split half correlation was done on an experimental group to determine the reliability of the instrument. While this is useful and provides some reliability information for the, tests, it does not deal with the weightier problem of variability in testing conditions which might be revealed in a testretest study. The fact that the instrument has been in use for several years and no such test has been conducted raises suspicious regarding the reliability of the test. 21 BSM The BSM was developed by Burt, Dulay and Hernandez (1975). It has a very complete theoretical framework that is derived from the assumption that children acquire a second language by a process of creative construction. (p. 11) They used many sources on child acquisition of languages, both L and L as a basis for the design of the BSM. 2 1 There seems to be, according to certain studies, a common order of acquisition of certain English grammatical morphemes by children acquiring English as a second language in the United States. After considering and rejecting pronunciation, vocabulary and functional use of language as possible indicators of proficiency, the authors chose a discrete point approach using syntax as the sole measure. The test was constructed so as to elicit naturally a range of structures in varying phases of acquisition using a structured conversation technique. Questions were asked referring to one or more of a set of pictures. Validity The authors of the BSM chose to use construct validity as the most appropriate way of approaching the question of validity. According to Burt, Dulay and and Hernandez, the construct validity of the BSM is supported by: 1. the body of psycholinguistic theory and research supporting the concept of natural sequences of acquisition; 2. the statistical procedures and research which have resulted in the production of the acquisition hierarchy and of the scoring system which is based on this hierarchy; and 3. evidence that the BSM classifications reflect the relationships expected to be found among bilingual children. (p. 32) The only linguistic aspect tested is syntax. Several studies cited in a paper by Krashen (1978) showed that children acquiring English either as a first or second language showed a similar order of acquisition grammatical morphemes. This "natural order" of language acquisition theory is not without its critics (Rosansky, 1976), some of whom achieved different results using the BSM than those indicated by the authors. Reliability The two major reliability studies undertaken dealt with test-retest reliability and interscorer reliability. On the test-retest study, one out of three students scoring on levels 4 and 5, which are the most crucial levels for discriminating between subjects for program placement, was placed at least one level higher or lower on the post test than on the pre-test. , .. .• ): This means that 23 one out of three students could be misplaced by a single administration of the BSM. However, it was ascertained that the interscorer reliability was higher for the English test than for the Spanish test due to there being more disagreements over the proper form of the many dialects of Spanish spoken. LAB The LAB was developed by the Board of Education of New York City in response to a Consent Decree of 1974. It is a discrete point test with a great emphasis upon receptive skills, since in each of the three versions there are far more items in listening than in any other oral language skill, upon production. There is also less emphasis placed Since the LAB was intended to be given only as a battery, it was not possible to use only the oral production sections separately and we were unable to use the results from this test in the study. Therefore, no further discussion of its validity or reliability will be attempted. LAS The LAS was developed by DeAvila and Duncan (1975). It measures a student's performance across four linguistic subsystems: phonemic, referential, syntactical and pragmatic. There is a theoretical basis given for the inclusion of each subsystem. It provides a profile of the linguistic problems of the individual child, which is one of its strong points. In order to avoid problems of interscorer reliability, a tape cassette is used for almost all of the test so that the children tested hear the same thing in the same manner. It is somewhat complex to administer since the examiner must handle a tape recorder, a scoring sheet and a test booklet with faint pictures, and also at one point must write down a story told by the child. Validity The LAS is discrete point for the first four parts of the test and then changes in the last section with the re-told story. well. In th~ The two halves do not correlate very study for internal reliability, the syntax production is excluded. .89 for phonemes, The totals for subsystems were: .87 for minimal pairs, and .68 for comprehension. .72 for lexical, The comprehension coefficient appears to be lower than'the others. There does not seem to be very much connection between the two halves of the test. For instance, the pronunciation section is con- trasted with the storytelling where accent is irrelevant. Yet the storytelling section is worth half the total score. The same problems occurred in the production section as with the BINL, where content was not particularly important. The validity studies done by the authors dealt primarily with the syntactic section. An attempt is being made to show criterion validity, i.e., the classification received on the LAS correlates with academic achievement. Those studies are in progress. Reliability As mentioned above, the authors of the LAS are the only one of the four tests studied who controlled for interscorer reliability. This was done by using a cassette to administer the majority of the test items. The only problem was with the children in the study who were unused to a tape recorder. They seemed overcome by it at times. Practicality All oral language tests must be administered individually which has a direct effect upon considerations of practicality. Practical experience in the administra- tion and scoring of all four tests in the study have led to the following conclusions. In terms of time and equipment necessary to administer the tests, LAS required the most equipment and the most time, since there was no cut-off mechanism when a child was obviously NEG. A tape recorder and pre-recorded tape, scoring sheet and pencil, and test booklet with pictures were required. The LAB required a scoring sheet, instruction/picture booklet, student answer sheet and pencils. Since the written part was not administered for this study, it is not as familiar to the researcher. However, it is the only test that could be administered to a small group. The BSM used a picture booklet, a scoring sheet with the questions on it and pencils. There was a cut-off mechanism to use if the child could not answer certain questions. The BINL required pictures and a tape recorder to administer. The taped response was later transcribed and scored. As far as the skill and training necessary to administer the four tests, the LAB was aruninistered according to the manual without any problems. The BSM was also explicit in what the examiner was to do and say. The LAS, since it used a cassette, avoided problems in this area. For the re-told story part, questions were indicated on the scoring sheet. The BINL was the only one where the training and the manual appeared to be inadequate. Even though the Los Angeles School District provided the training at several inservices, there was a great deal of variety in the actual administration of the test. Ease of scoring is very important and, since none 27 of the tests except the LAB, is totally discrete, there is an element of judgment involved. The BSM is easier in terms of the desired structure being indicated. The first part of the LAS is very simple, since only incorrect responses are indicated on the score sheet. However, the oral production section involves a great deal of judgment to score. To summarize, a certain amount of time is involved in the administration of the tests, more for the LAS than the others. A certain amount of skill and training is necessary to administer the four tests, with the BINL requiring more intensive training. The scoring is simpler for the LAB and the BSM and more difficult for the LAS and the BINL. tedious. The transcription of the BINL can be very It is also an inhibitor to the examiner and possibly to the subject when answers must be dictated and written down. All of the four tests had weaknesses to one degree or another. In a paper presented at the National Associa- tion of Bilingual Educators conference in Seattle in 1979, Dieterich, Freeman and Crandall discussed some findings of a recent study of proficiency tests. Their conclusions were that the tests that exist today are appropriate for discriminating between the extremes of proficiency but are unable to distinguish more exactly the range of proficiency between these two extremes. Almost every test fails to measure what they say they are measuring. For instance, the LAS purports to measure the understanding of a passive sentence where only one picture shows the subject of the sentence and therefore it is a measure of vocabulary. Similarly the LAB asks a sentence that intends to test grammar, but again only one picture of the subject is shown and so it is vocabulary that is tested. The BINL type of test where a child is asked to tell a story or describe a picture tends to penalize the child who is conservative in the language learning style. There is a problem, too, with those tests that require a complete sentence for an answer since normal speech is somewhat elliptical. The assignment of complexity level to utterances is debatable due to assumptions made about the complexity of certain structures. The result of this study, according to Dieterich, Freeman and Crandall, is that those tests which adhere to the evidence about English language acquisition are certainly more useful, but they are not enough. Until we have more evidence of what linguistic, social, and other skills children actually need to enable them to function in an English-speaking classroom (including cognitive strategies), the people in schools who are probably in the best position to know ... are the teachers who deal with the students on a daily baiss. (p. 20) They also feel that more than one measure is needed to supplement the teacher's judgment. The only problem with teacher judgment is that there are some teachers who are not truly perceptive and feel if a child can carry on a simple conversation or obey classroom commands, then that child is fluent. In that situation the test then is necessary to give a somewhat more accurate assessment of the child's proficiency. Thus it becomes evident that there is little agreement on type of test, what is to be tested, and whether or not it should be standardized or criterionreferenced, which leads us lack to the idea that it may not be possible to measure the language of children with anything more than superficiality. In Chapter III the design and procedure of the study will be discussed. CHAPTER III DESIGN AND PROCEDURE OF THE STUDY The purpose of this study was to administer four oral English language proficiency tests to a target population to determine whether the tests were measuring the same factors, and if they were assigning the children to the same linguistic categories. Tests The four tests and the range of administration are: 1. Basic Inventory of Natural Language (BINL) K - 12 2. 3. Bilingual Syntax Measure (BSM) Level I K - 2 Level II 3 - 12 Language Assessment Battery (LAB) Level I K - 2 Level II 3 - 6 Level III 7 - 12 4. Language Assessment Scales (LAS) Level I K- Level II 6 - 12 5 30 01 Subjects The target population for the study was identified as 150 first graders in 8 classes and 145 third graders in 8 classes, identified as speaking other than English at horne by the Horne Language Survey required by the Los Angeles Unified School District to satisfy LAU requirements. The survey form is completed by the child's parents indicating what languagejs are used at horne, what language the child hears or speaks. majority of the children were from Mexico. is an inner-city school. The The school Random samples of 61 first graders and 64 third graders were selected using class lists for names. The combined sample totalled 125 children. Design To minimized testing order and examiner effect, a counterbalanced design was used. The subjects for each grade level were randomly assigned to each of the four examiners, for whom the order of testing was varied. The tests were administered in the following order: Examiner I - BINL, LAB, Examiner II - BSM, LAS BINL LAS, BSM, LAS, Examiner III - BSM, LAS, BINL, LAB Examiner IV - LAB, BINL, LAS, BSM 32 Procedure The examiners were all native speakers of English who were bilingual in the children's first language. They were trained in the administration of the instruments through five hours of inservicing by the author of this study, except for the BINL, which was conducted by the Los Angeles Unified School Disctrict. All teachers received BINL inservice as part of a district-wide program. Test manuals were available during the training. One week of testing was planned per grade, with one test administered to a child per day. All first graders were tested in the morning since their school day ended shortly after lunch. The third graders were tested during the morning and afternoon. All testing was completed in four weeks, including testing those children absent on the regular testing day. There was strict adherence to the order of testing as detailed above. Each child was tested individually with only one test given per day. In order to gain the cooperation of t,he teachers whose classes were involved in the study, an agreement to do their group BINL testing was made. BINL testing was conducted in the classroom in accordance with the school district's policy. The other tests were given outside the classroom where possible so as not to disturb the teachers and lose their cooperation. 33 Only one form of test was used for the BINL and the LAS tests. LAB. Levels I and II were used for the BSM and the To include as many levels of the tests as possible, first and third grades were selected as the target population. The BINL transcriptions, made by the examiners, were sent to the BilinguaVESLsection of the Los Angeles Unified School District to be machine scored with the rest of the school's tests, since the results were to be used by the district as well as this study. The remainder of the tests were handscored by the researcher. In order to have as parallel a testing situation as possible, only the Speaking and Listening, Level I, and the Speaking, Level II sections of the LAB were used. The written sections were not used. However, since the LAB was intended to be given as a battery of tests, the scoring manual did not give any separate test equivalents, and the results could not be subjected to the statistical analysis along with the other tests. For the other three examiners' reactions to the tests and their administration of them see Appendix B. Statistical Analysis With the exception of the BINL, which was machinescored by the school district, the tests were hand scored by the researcher. Both the BSM and the LAS were complicated to score, since there were some judgments that had to be made about the complexity of responses. The LAS was more difficult to score because it required scoring of the dictated re-told story, using samples in the manual which were scored according to the age of the subject. After administering the tests and scoring them, there were four scores for each child. These were first arranged by grade level and examiner (See Appendix C). After keypunching these scores, two data decks were obtained: one with the raw scores, and one with the converted or Z-scores. It was necessary to use z-scores for comparison of different tests. The z-score is a standard score that is defined as the distance of a score from the means as measured by standard deviation units (Ary, Jacobs, Razavieh, 1972). z = The formula for finding the z-score is: X 0: X= X (J where X = X= a = raw score the mean of the distribution the standard deviation of the distribution x = deviatibn score (X - X) 35 The size of the sample, N=l25 (61 first graders and 64 third graders), was sufficient to use several correlational measures. Ary, Jacob, and Razavieh (1972) define correlation as The statistical technique used for measuring the degree of relationship between two variables Correlations show us the extent to which values in one variable are linked or related to values in another variable ... The measure of correlation between two variables results in a value that ranges from -1 to +l ... A coefficient of correlation near unity, -1 or +1, indicates a high degree of relationship Correlation coefficients in educational and psychological measures, because of the complexity of these phenomena, seldom reach the maximum points of +l or -1. For these measures, any coefficient that is more than .90 is usually considered to be very high. (pp. 115, 116) Three correlations were considered: the Pearson product moment coefficient (r), the.Spearman rho, and Kendall's tau. The Spearman rho was not used since there were a lot of tie scores in the study, which can affect rank ordering, a part of the process used in obtaining the Spearman rho. The Pearson product moment coefficient is obtained from the mean of the z-score products, i.e., each individual z-score on one variable is multiplied by his/her z-score on the other variable. These paired z-scores are added and this sum is divided by the number of pairs. The Pearson r is an interval scale and is related to the mean. z r = The formula for the Pearson r is: z X y N Since there were four variables, more than one coefficient had to be figured. As mentioned earlier, the LAB scores could not be used in many of the calculations because only part of the battery was used and there was no provision made in the scoring manual for this. In addition Kendall's tau was applied to the scores. This, like Spearman's rho, is non-parametric and does not depend upon a normal distribution. It is related to the phi coefficient, which is a nominal scale (genuine dichotomy, i.e., male/female) characteristic of both variables. In order to translate the linear relationship into more practical terms the following question was asked: were the four scores assigning the same child to the same linguistic category (Non-English Speaking, LimitedEnglish Speaking, Functional-English Speaking, or Proficient-English Speaking) each time? To answer the question, a cross tabulation of the individual converted, or z-scores, was run on the computer as well as a scattergram which gave a graphic representation of the distribution of the scores. Chapter IV will discuss the finding of this study. Statistical correlations were made for each of the tests and have been analyzed in the following chapter. CHAPTER IV FINDINGS OF THE STUDY One of the problems with this study was that while dominance and proficiency measures were the goal of these tests, i.e., the same test or its equivalent was given in two languages-English and the child's native onethe term dominance itself is ambiguous. Silverman (1977) feels that language dominance as a term needs to be well defined before testing for it. Thus the use of tests to measure both proficiency and dominance is thrown into doubt. As indicated in the preceding chapter, the statistical techniques applied to the results obtained in this study were correlation coefficients. They were used to: determine if each of the instruments rendered comparable proficiency scores for the subjects; and ascertain a difference between the tests that would indicate that one was more reliable or valid than another. The Pearson r correlation is shown in Table 1, using all four tests. 39 Table l. BINL The Pearson r correlation BSM LAB LAS 1.0000 .5206 .7874 .6706 LAB .5206 1.0000 .6754 .5908 LAS .7874 .6754 1.0000 .7556 BSM .6706 .5908 .7556 1.0000 BINL N=l25 No correlation was over .7874. According to Ary, Jacobs, and Razavieh (1972) this is not particularly a high correlation. The variance is .62, which shows less than two-thirds of the variance is common to the tests. The other correlation used was Kendall's tau (Table 2) which makes no assumption about the distribution of the cases. It is used to determine if the two rankings of the same cases are in the same order Table 2. BINL BINL 1.0000 Kendall's tau correlation LAB LAS BSM ' .6316 .4894 .6177 LAB .3956 .3956 1.0000 LAS .6316 .4893 1.0000 .6915 BSM. .6177 .5364 .6915 1.0000 N=l25 .5364 Here the highest correlation is .6915 between the LAS and the BSM. When the coefficients are squared, however, the .4781 result shows that less than one half of the variance is common to the two tests. The statistical procedure that showed greater relationships or showed the relationship most clearly between the three tests for which converted scores could be obtained (excluding the LAB) was the crosstabulation of the individual test's converted scores (see Tables 3, 4, and 5). The crosstabulations compared pairs of tests on their placement of children in the four linguistic categories defined earlier: NES, LES, FES, and PES (see Appendix C for cut-off scores for assigning subjects to these categories). Since pairs of tests were compared, it was necessary to run three different crosstabulations. In each table, the categorizations that agreed, i.e., where both tests placed the same child in the same category, are indicated by a box around them. CROSSTA~:tiLil.'T'IO~i OF 1-'LACE!~EN'T j3,\.SED CN HID I VIDUAL TEST'S CC~~~RTEU 'fABLE: 3, SCORES Crosstabuliltion of LAS and :nl\TL scores LAS BINL !'.'"ES LES FES PES TOTI.L !ES @ 1 2 1 60 .Es 26 ill 6 1 39 ;_~Es 8 3 ill 1 18 ''ES 0 ') 2 0 8 90 1 :~ 16 7 125 TCTi ;:, TABLE 4. CrosstabuJ.ation of LAS and BS:! scores LAS :--.~ ~:)!· 1 NES NES LES FES ~ 7 3 2 9tl @] 1 0 4 2 ill 1 10 DJ 13 LC:S 3 FES 0 PES 1 3 5 90 12 16 TOTAL TABlE 5. ·r Crosstabulation of i3INL and :'.S~·l PES TOTAL 7 125 PES TOTAL scores BSN NES BINL LES FES 0 1 1 60 2 5 39 18 NES [ssJ LES 29 FES 9 1 ill 3 PES 2 0 2 [1] 98 4 10 13 TOTAL m 8 125 If the crosstabulation of placement for the BINL and the LAS is examined (Table 3), it is seen that out of a total of 90 NES children identified by one test but not the other, the two tests agreed on 56 out of 125 as being NES. There were 26 children that the LAS categorized as NES that were categorized as LES by the BINL, and 8 children that the BINL considered FES as opposed to NES for the LAS. For the LES category, there were 39 identi- fied by the BINL and 12 by the LAS, with agreement on only 6 of the children. The total for the FES category was closer, 16 for the LAS and 18 for the BINL, but with agreement on only 6 scores. In the PES category there were 7 scores for the LAS and 8 for the BINL with agreement on 4 scores. There was an overall agreement for this pair of tests of 57.6% (72 out of 125 scores). Looking at the LAS and the BSM crosstabulation (Table 4), it is seen that there is more agreement between these two tests at the NES category, with the LAS assigning 90 to NES and the BSM assigning 98 with agreement in 86 cases. However, in the LES column it is evident that there was little agreement between the 12 LES scores of the LAS and the 4 LES of the BSM, since the BSM assigned 7 out of 12 LES scores of the LAS to NES. In the FES column, the LAS assigned 16 scores to FES and the BSM only 10 with agreement on 7 of the scores. In the PES 43 column, LAS placed 7 of the children, while the BSM placed 13 with agreement on 4 scores, the same as with thh BINL and the LAS. The overall agreement between the LAS and the BSM was 77.6% (97 out of 125 scores). The crosstabulation for the BSM and the BINL (Table 5) was different in that the BSM assigned more children to the NES category, 98, than the BINL, 60, with agreement in 58 of the cases. There was also a discrepancy in scores in the LES category, with the BSM assigning only 4 scores while the BINL assigned 39 scores the LES with agreement on 3 scores. The BSM placed 19 scores in the FES category and the BINL put 18 children in that category, with agreement on 5 scores. In the PES category, the BSM assigned 13 children and the BINL assigned 8, with agreement on 4 children. The BSM placed more children in the PES category than did the other two tests. There was overall agreement of 56% (70 out of 125 scores). Looking at the scores another way, the Summary table (Table 6) allows a comparison of the three tests ' in their placement of children in linguistic categories. 44 Table 6. Summary of LAS, BINL, and BSM category placements NES LES FES PES TOTAL BINL 60 39 18 8 125 BSM 98 4 10 13 125 LAS 90 12 16 7 125 It appears, with the exception of the NES category, that there is not very much agreement between the tests in all categories as far as placing children at linguistic levels. Another aspect of the crosstabulation is that there were children placed in opposite categories by the tests. For instance, one child who was placed at the PES level by the BINL was put at the NES level by the LAS. Similarly, two children classified as NES by the BSM were placed at the PES level by the LAS and vice-versa for another child. A like situation occurred with the BINL and the BSM (Table 5). It appears that a child may receive a different linguistic classification depending upon what test is administered. Reviewing the tables, it would appear as though the BINL produced more LES scores and fewer NES scores than ' ( • ~ ~ the other two tests. It is difficult to determine 45 exactly why this is so. Does this mean that the BSM and the LAS tests are "harder"? Were the results skewed by the fact that the first grade produced far more NES children? One explanation might be different language acquisition patterns of children at different ages. Another explanation might be differences in the length of exposure to the English language, with the first graders not having had as much exposure as the third graders. Another reason for the discrepancies is that the instruments are measuring different attributes. What one test measures in the name of language proficiency is not what the other measures. As mentioned above, there are serious reliability problems with the instruments. There- fore it is not surprising to get different results from the tests even if they were measuring the same attributes. It was evident that pupil error existed. The popula- tion being tested consisted, among others, of many children directly from Mexico. They were unaccustomed to tape recorders, which figured in the BINL and the LAS tests. They were unfamiliar with the testing situation itself, especially the first graders, and were culturally reluctant to speak at all, especially in a language which they did not control well or at all. Even though the LAS used a pre-recorded tape, the very first part of the test bothered a number of the first graders because it dealt with a concept that many had not mastered in any language - same or different. Several children broke down and cried when faced with the task and one child could not continue. In Chapter V a summary will be presented, conclusions will be given and some recommendations will be presented. CHAPTER V SUMMARY, CONCLUSIONS AND RECOMMENDATIONS It is evident from this correlational study that the four tests, BINL, BSM, LAB, LAS, do not test the same linguistic areas. Three of them, BSM, LAB, and LAS tested syntax, while the BINL did not do this directly. If a correlation coefficient of .90 is con- sidered to be high, Ary, Jacobs, and Razavieh (1972), and in this study there were coefficients ranging from .787 to .521 for the Pearson rand from .396 to .692 for the Kendall's tau, then it would follow that not only is there a low correlation, but also there is a lack of agreement in placement. As discussed in Chapter IV, between the three tests that were crosstabulated (Table 3, 4, 5) there were instanceswhere the LAS and the BSM placed children in opposite categories with 77.6% overall agreement for all the scores; and also where the BINL and the BSM did'the same with an overall agreement of only 56% for all the scores. The greatest variance obtained using the Pearson r (Table l) was .62 and for the Kendall's tau (Table 2) was .48. Therefore they do not seem to measure the same thing. 48 As discussed in Chapter II, there are different theoretical bases for the tests, and quite possibly other variables, such as inter-examiner reliability, may have entered into the results. A summary of the negative and positive aspects of the tests based on the literature review and personal experience follows: 1. BINL a. Lacks inter-examiner reliability - the administration can vary widely. b. There is not enough validity shown construct validity is lacking. i. The scoring values are arbitrarily assigned. ii. The technical manual is too ambiguous regarding transcription. c. It doesn't show the full range of child's control of language structures. 2. LAB a. It is a discrete point test. i. Younger children often don't read or write. ii. Not enough justification is given for the items on the test. iii. It was only normed in New York City with Puerto Rican children. 49 3. BSM a. Having to take down dictated responses can be difficult. i. It leads to inter-examiner reliability problems. ii. It can inhibit a child's production, i.e., if the examiner doesn't hear the response the first time. b. The theory, although sound, has been challenged by Scharf (1972). c. Since it is an interval item test, it may not have enough items between categories. d. There is concern about the content of the story in English, i.e., drinking ~ink ink (children can be suggestible). 4. LAS a. It tests most linguistic aspects. i. It would be better used as a diagnostic test. b. The concept of "same or different" on the minimal pairs part of the test may not have been acquired by younger children. c. It is difficult to administer. i. ii. Dictation is hard to record. The juggling of materials is hard on the examiner. iii. The pictures in the test booklet are too faint to be seen easily. d. The tape is intimidating to some children but it does deal with the problem of interexaminer reliability. e. It appears to have been developed with more research than the BSM, LAB, or BINL tests. Another problem with the LAS test was the amount of time necessary for administration. The BINL and the BSM tests had cut-off mechanisms for the NES child where the testing would stop more quickly. whole tape had to be played. With the LAS, the This alone would make it very difficult and time-consuming to use in situations where large numbers of children are to be tested and where there are time constraints. It appears that none of these four tests is fully adequate to fulfill the testing and classification requirements of LAU versus NICHOLS. The better tests of the four seem to this researcher to be the LAS or the BSM based on the correlations, the percentage of agreement on placement and for practical considerations. At present, however, it seems as if it may actually be impossible to ascertain exactly how much language a child controls in any but a superficial way . . ;. ,~ . ·.• I ~ 51 Recommendations With the use of tests having questionable validity, there is a real chance that a child might be placed in the wrong program. The cost in human potential wasted is alarming to consider. However, with the continued influx of children speaking languages other than English into the school systems of California, and the mandates of LAU versus NICHOLS, there must be linguistic classification of these children for program purposes. As a stopgap measure, since they must be placed, a combination of tests plus teacher observation can be used. This combination, in turn, has the limitation of subjectivity of observation and must be applied with extreme care. It is imperative that the whole process be improved since it is children who are being classified and whose educational success is at stake. A test is needed that: 1. will show proficiency in English; 2. is rel'a t i vely easy to administer, especially to large groups of children; 3. is relatively easy to score; 4. is reliable and valid. Any replication of this study should, among other things, increase control for inter-examiner reliability. 52 The LAB cannot be used successfully except in its entirely and therefore should not be included in the study. BIBLIOGRAPHY Aitken, Kenneth G., "Discrete Structure-Point Testing: Problems and Alternatives", TESL Reporter, Vol.9, No. 4, 1976, pp. 7-9, 18-20. Anderson, Scarvia B., "Verbal Development in Young Children: Strategies for Research and Measurement", Paper at International Congress of Psychology, August, 1972. Assessment Instruments in Bilingual Education, Center for Bilingual Education, Northwest Regional Educational Laboratory, California State University, Los Angeles, Ca., 1978, pp. 10-ll, 26-27. Blatchford, Charles H., "A Theoretical Contribution to ESL Diagnostic Test Construction", Paper at Fifth Annual TESOL Conference, March, 1971. Bordie, John G., "Language Tests and Linguistically Different Learners: The Sad State of the Art", Elementary English, 47.5, October, 1970, pp. 814-828. Briere, Eugene, "Current Trends in Second Language Testing", Papers on Language Testing 1967-1974, ed. Palmer and Spolsky, TESOL, Wash. D.C., 1975, pp. 220-228. Britton, Augusto, "A Brief Review of Assessment Instruments for the Bilingual Student", CABE Evaluation Task Force, Office of Los Angeles County Superintendent of Schools, Division of Curriculum and Instruction, Los Angeles, Ca., May, 1975. Carroll, John B., "Fundamental Considerations in Testing English Proficiency of Foreign Students", Testing, Center for Applied Linguistics, Wash. D.C., 1961, pp. 31-40. , "Foreign Language Testing: Will the Persistent Problems Persist", Paper at ATESOL Conference, June, 1973. ----::=:----. 54 Cartier, Francis A., "Criterion-Referenced Testing of Language Skills", Papers on Language Testing 1967-1974, ed. Palmer and Spolsky, TESOL, Wash., D.C., 1975, pp. 19-24. Cazden, Courtney B., "Concentrated vs Contrived Encounters: Suggestions for Language Assessment'', Urban Review, 8.1, Spring, 1975 (a), pp. 28-34. , "Hypercorrection in Test Responses", Theory Practice, 14.5, December, 1975 (b), pp. 343-346. ----~I~n-t~o , and Others, "Language -----::and How", Anthropology and Assessment: Where, What Education Quarterly, 8.2, May, 1977, pp. 83-91. Cohen, Andrew, "The Sociolinguistic Assessment of Speaking Skills in a Bilingual Education Program", Papers on Language Testing 1967-1974, ed. Palmer and Spolsky, TESOL, Wash., D.C., 1975, pp. 172-183. Condon, Eliane, "The Cultural Content of Language Testing", Papers on Language Testing 1967-1974, ed. Palmer and Spolsky, TESOL, Wash., D.C., 1975, pp. 204-217. Cowan, Weber, Hoddincott, and Klein, "Mean length of spoken response as a function of stimulus, experimenter, and subject", Child Development, 38, 1967, pp. 191-203. DeAvila, Edward A. and Duncan, Sharon E., LAS Language Arts Supplement, Spanish, Revised Edition, Linguametrics, Corte Madera, Ca., 1977. , Cervantes, and Duncan, "Bilingual Programs Exit Criteria", CABE Research Journal, Vol. 1, No. 2, September, 1978, pp. 21...,.39. ----:=--;- De George, George P., "Guidelines for Selecting Tests for Use in Bilingual/Bicultural Education Programs", Paper at MATSOL Spring Conference, 1975. Doyle, Vincent, "A Critique of the Northwest Regional Laboratory's Review of the MAT-SEA-CAL Oral Proficiency Tests", Paper, October, 1976. Ehrlich, Alan, Tests in Spanish and Other Languages and Non-Verbal Tests for Children in Bilingual Programs: An Annotated BEARD Bibliography, RIE, August, 1973. Fahey, Virginia K. and Others, "Heritability in Syntactic Development: A Critique of Munsinger and Douglass", Child Development, 49.1, March, 1978, pp. 253-257. Fitch, .Michael J., "Verbal and Performance Test Scores in Bilingual Children", Thesis ED.D., University of North Colorado, 1966. Fletcher, B., Locks, N., Reynolds, D., and Sisson, B., A Guide to Assessment Instruments For Limited English Students, Santillana Publishing Company, New York, N.Y., 1978. Foster, R., Giddon, J., and Stark, J., Manual for the Assessment of Children's Language Comprehension, Consulting Psychologists Press, Palo Alto, Ca., 1972. Garcia-Zamor, M., and Birdsong, D., Testing in ESL: An Annotated Bibliography, Cal-Eric/CLL Series on Languages and Linguistics, #40, January, 1977. Gil, Sylvia, "BSM Assesses Linguistic Proficiency in English and Spanish", Paper, 1976. Gonzalez, Josue and Fernandez, Ricardo, "Toward the Development of Minimal Specifications for LAD-Related Language Assessments", Bilingual Resources, National Dissemination and Assessment Center, Los Angeles, Ca., Vol. 2, No. l, Fall, 1978, pp. 2-7. Groot, Peter, "Validation of Language Tests", Papers on Language Testing 1967~1974, ed. Palmer and Spolsky, TESOL, Wash., D.C., 1975, pp. 137-143. Helmer, S., "Demonstration of Assessment of Language Dominance of Spanish Speaking Bilingual Children", Occasional Papers on Linguistics, Paper at International Conference on Frotiers in Language Proficiency and Dominancy Testing, April, 1977. Hillman, R. E., A Correlational Study of Selected VocalVerbal Behaviors and the Test of ESL (TOEFL), Thesis, Ph.D. Pennsylvania State University, 1972. Hinofotis, F.A.B., An Investigation of the Concurrent V~liditY ~f Cloz~ Testing as a Measure of Overall Proficiency in ESL, Thesis, Ph.D. Southern Illinois University, 1977. 56 Jony, Jon G., "Can't Language Testing Interface with Language Acquisition?", Paper at TESOL Conference, 1975. Lado, R., Language Testing: The Construction and Use of Foreign Language Tests. A Teacher's Book, RIE, 1970. Language Assessment Battery, Test Review in Bilingual Resources, National Dissemination Center, Los Angeles, Ca., Vol. 2, No. 2, Winter, 1979, pp. 40-41. Language Assessment Scales, Publisher's Test Service, CTB/McGraw-Hill Publishers, Monterey, Ca., 1978, p. 2. Levine, J., "An Outline Proposal for Testing Communicative Competence", English Language Teaching Journal, 30.2 January, 1976, pp. 128-134. · Luft, Max and Others, "Development of a Test Instrument to Determine Language Dominance of Primary Students: Test of Language Dominance (TOLD)", Paper at Annual Meeting of American Educational Research Association, April, 1977. Matluck, J., and Mace, Matluck, B., "The Multilingual Test Development Project: Oral Language Assessment in a Multicultural Community", Paper at National TESOL Conference, March, 1975. McDavid, R. I., and William, A., "Communicative Barriers to the Culturally Deprived", USOE Report, University of Chicago, Chicago, Ill., 1966. Oller, John E., "Discrete-point tests vs tests of integrative skills", Focus on the Learner, ed. Oller and Richards, Newbury House, Rowley, Mass., 1973. -----;- , "How Important is Language Proficiency to I.Q. and Other Educational Tests?", Occasional Papers on Linguistics #1, Paper at International Conference on Frotiers in Language Proficiency and Dominancy Testing, April, 1977. Phillips, Judith, "The Effects of the Examiner and the Testing Situation Up n the Performance of Culturally Deprived Children. Phase I - Intelligence and language ability test scores as a function of the race of the examiner. Final Report", October, 1966. 57 Politzer, R. and McK y, M., "A Pilot Study Concerning the Development of a Spanish/English Oral Proficiency Test", Research Development Memorandum #120, 1974. Puthoff, F. T., The Development of Norms for Bilingual First Grade and Third Grade Children's Responses to the Hand Test and Peabody Picture Vocabulary Test, Thesis, ED.D., University of Oklahoma, 1972. Randle, Janice A. W., A Bilingual Oral Language Test for Mexican-American Children, Thesis, Ph.D., University of Texas at Austin, 1975. Robinson, Gail, "Linguistic Ability: Some Myths and some Evidence", Paper, Australia, April, 1975. Robinson, Pete, "Basic Factors in the Choice, Composition and Adaptation of Second Language Tests", Paper at TESOL Conference~ March, 1969. , "The Composition, Adaptation, and Choice of Second Language Tests", English Language Tesching, 25.1, October) 1970, pp. 60-68. ------,=--- Rosansky, Ellen J., "Methods and Morphemes in Second Language Acquisition Research", Language Learning, 26.2, December, 1976, pp. 409-425. Rose, S. and Others, "The Development of a Measure to Evaluate Language Communication Skills of Young Children', Paper at Annual Meeting of American Education Research Association, February, 1973. Scharf, Donald, "Some Relationships between Measures of Early Language Development', Journal of Speech and Hearing Disorders, 37.1, February, 1972, pp. 64-74. Silverman, H. and Russell, R., "The Relationships Among Three Measures of Bilingualism and Their Relationship to Achievement Test Scores", Paper at Annual Meeting of American Research Association, April, 1977. , Noa, J., and Russell, R., Oral Language Tests ---f"'o_r_ Bilingual Students on Evaluation of Language Dominance and Professional Instruments, Northwest Regional Educational Laboratory, Portland, Ore., July, 1976. 58 Spolsky, Bernard, "Language Testing - The Problems of Validation", Papers on Language Testing 1967-1974, ed. Palmer and Spolsky, TESOL, Wash., D.C., 1975, pp. 146-153. , Murphy, P., Holm, W., Ferrel, A., "Three Functional Tests of Oral Proficiency", Papers on Language Testing 1967-1974, ed. Palmer and Spolsky, TESOL, Wash., D.C., 1975, pp. 75-87. ----==-- Sponseller, D. B. , ''Measuring Language Comprehension in Young Children: Does the Structure of the Testing Condition Affect Results'', Paper at Annual Meetinh American Educational Research Association, April, 1977. Swain, M., "Evaluation of Bilingual Education Programs: Some Problems and Some Solutions", Paper at Conference of Comparative International Education Society, February, 1976. Toronto Board of Education, "Testing Some English Language Skills: Rationale, Development, and Description" Paper, March, 1969. Upshur, J., "Objective Evaluation of Oral Proficiency in the ESOL Classroom", Papers on Language Testing 1967-1975, ed. Palmer and Spolsky, TESOL, Wash., D.C., 1975, pp. 52-65. Valette, Rebecca, Modern Language Testing, Second Edition, Harcourt, Brace, Jovanovich Inc., New York, N.Y., 1977. Wong-Fillmore, Lily, The Second Time Around: Cognitive and Social Strategies in Second Language Acquisition, Thesis, Ph.D., Stanford University, 1976, Chapters 5' 6. Wright, S.M., The Effect of Speaker Visibility on the Listening Comprehension Test Scores of Intermediate Level Students of ESL, Thesis, Ph.D. Georgetown University, 1971. APPENDICES 60 APPENDIX A DESCRIPTION OF TESTS FROM A GUIDE TO ASSESSMENT INSTRUMENTS BY PLETCHER, LOCKS, REYNOLDS, SISSON (1978) 1977 - BASIC INVENTORY OF NATURAL LANGUAGE (BINL) Descriptive Information: Purpose: To assess a student's language dominance and proficiency in Spanish and English. Score Interpretation: This instrument yields raw scores in English or Spanish which represent a student's fluency and level of language complexity. Models are provided for the development of local norms. Grade Range: K-12 (reviewed for grades K-6) Target Ethnic Group: General (reviewed for Cuban, Mexican-American, and Puerto Rican; see also Comprehensive Index for other language versions under development). Administration Time: From 10-15 minutes: not timed. Administrator Requirements: The administrator should be proficient in the languages in which the instrument is administered and should have a knowledge of simple grammar. CHECpoint Systems also sponsors a one-day training workshop for test administrators. 61 Author: Charles H. Herbert Source: CHECpoint Systems, 1558 N. Waterman Avenue, Suite C, San Bernardino, California 92404 Cost: A kit with 80 talk tiles, 40 story starter pictures, 1 spirit duplicating masters book, 1 instruction manual, 400 oral score sheets, 100 profile sheets, and 2 class profile cards costs $85.00. A tape recorder is also required. This criterion-referenced instrument makes use of story sequence pictures and talk tiles to elicit natural speech samples. Instructions are given orally in English or Spanish. Students respond by telling a stody. hand or machine scored. Taped answers are Individual administration is required. Technical Information: Although sample normative results have been reported for 300-400 students in the lower elementary grades in California, reviewers felt that this instrument is basically a criterion-referenced test. Reliability measures are not yet available, and the only validity measure reported indicated that the language complexity subscores tended to rise from grades K-2 in both firstand second-language growth. Reviewers suggested that this instrument be used as a diagnostic rather than an achievement test until more meaningful norms are established. The administrator's manual contains much information, although it provides no specific oral cues for the administrator in English or in Spanish. The hand scoring procedures are fully described, but reviewers found them to be quite complex. Cultural and Linguistic Information: Reviewers found the item content and vocabulary to be appropriate, with minor revisions, for use with Hispanic students in grades K-6. Reviewers commented that the talk tiles contained no stimulus objects which reflected Hispanic culture, and that the sequence stories had no stimulus words in Spanish. The illustrations, although quite interesting, seemed to depict minority cultural groups as country folk while showing "Anglos" in urban settings. Nevertheless, the format and procedures were found to be highly acceptable for use with Hispanic children. 1975 - BILINGUAL SYNTAX MEASURE (BSM) MEDIDA DE SINTAXSIS BILINGUE Descriptive Information: Purpose: To measure syntactic proficiency by eliciting natural speech samples i~ Spanish and English. 63 Score Interpretation: This instrument yields hierarchical scores which place a child in 1 of 5 proficiency levels in English and Spanish. Instructional suggestions are provided for each proficiency level. Grade Range: K-2 Target Ethnic Group: General (reviewed for Cuban, Mexican-American, and Puerto Rican; see also Italian and Tagalog entries) Administration Time: Approximately 15 minutes; not timed. Administrator Requirements: The administrator should be proficient in Spanish if the Hierarchical Scoring is used, and proficient in Spanish and English if the Syntax Acquisition Method is used. Authors: Marina K. Burt, Heidi C. Dulay, and Eduardo Hernandez Chavez Source: The Psychological Corporation, 757 Third Avenue, New York, New York 10017. Cost: A Spanish-English kit containing a Picture Booklet, 2 manuals, 70 Child Response Booklets, 2 Class Record Sheets, and a Technical Handbook costs $50.00. Similar, but not parallel, Spanish and English forms of this instrument measure a student's command of basic 64 English grammatical structures regardless of his pronunciation or general knowledge. If both English and Spanish versions are administered, the test may serve as a language dominance measure. The administrator asks the student 25 questions relating to 7 pictorial stimuli. Students respond orally; responses are recorded by the examiner and are hand scored. Individual administration is required. Technical Information: The authors have collected data on the use of the English version of this instrument with 1371 students, and on the use of the Spanish version with 1146 students, all of whom were in grades K-2 in 4 geographic regions of the United States. norms. These data are illustrative and are not The authors provided reliability data based on 150 Spanish-speaking students. Reviewers felt that the Kappa coefficients of .40 were low and questioned the use of factor analysis to demonstrate the validity of the BSM. The manual states that English proficiency scores on this instrument improve as a function of the students' time in the United States. However, the tabular data presented show some inversions as a function of time in the United States. There does seem to be a clear difference in performance between those students here 3 year or more and those who have been 65 here a shorter time, but reviewers felt that the BSM could be used to measure gross differences in language proficiency. They also noted that the administrator's manual was somewhat repetitive and commented that administrators should be required to attend training sessions. Cultural and Linguistic Information: Hispanic: Reviewers found the directions, item content, vocabulary, format, and procedures to be culturally and linguistically appropriate for Cuban, Mexican-American, and Puerto Rican students in grades K-2. They felt that the illustrations were excellent. 1975 - LANGUAGE ASSESSMENT BATTERY (LAB) Levels I-III Descriptive Information: Purpose: To assess a student's reading, listening comprehension, and speaking skills in English and Spanish in order to determine language dominance. Score Interpretation: This instrument yields stanine scores and percentile ranks by grade. Grade Range: K-12. Level I, grade K-2; Level II, grade 3-6; Level III, grades 7-12 (reviewed for grades K-6) . Target Ethnic Group: Hispanic (reviewed for Puerto Rican) Administration Time: Level I, from 5-10 munites; Level II, approximately 41 minutes; timed. Administrator Requirements: The administrator should be proficient in the language in which the test is administered and should be thoroughly familiar with the examiner's manual. Author: Office of Educational Evaluation of the Board of Education of the City of New York. Source: Houghton Mifflin Company, Test Department, P.O. Box 1970, Iowa City, Iowa 52240. Cost: Each test booklet and examiner's manual costs $.34 per copy for Level I and $.49 for Level II. The technical report costs $3.25. This norm-referenced instrument is composed of parallel English and Spahish versions. level II contains 92 items. administered in English. Level I contains 40 items; The instrument is first The administrator then uses the Spanish version to test students who scored below a designated cutoff point. Students respond orally, by pointing, by writing in the test booklet, and by marking answer sheets (on Level II only). Individual administra- tion is required for Level I and for part of Level II. Technical Information: The developers have established separate norms for the English and Spanish versions of the LAB. The English norming sample consisted of 12,532 monolingual students and the Spanish sample 6,271 Spanish-speaking students, all enrolled in grades K-12 in New York City. Reviewers felt the norms were well developed and well reported. The developers report reliability coefficients and standard error of measurement for all levels and subtests of both versions of this instrument. After studying the learning objectives provided for each subtest, Guide reviewers stated that the instrument had good face validity. Cultural and Linguistic Information Reviewers found the vocabulary, format, and procedures for Level I to be culturally and linguistically appropriate for Puerto Rican students, in grades K-2. However, they felt that the item content in the reading section was inappropriate for students in grades K-2 because many items required abstract reasoning as well as reading ability. Reviewers commented that the speaking section did not adequately test speaking since it only required the child to produce one-word responses. Reviewers felt that because of the computerized answer sheets, the timed nature of the reading tests, and the fine auditory discrimination required for the listening tests, Level II might not be appropriate for Puerto Rican students recently arrived in the United States. 1975 - LANGUAGE ASSESSMENT SCALES (LAS) English Version - Level I Descriptive Information: Purpose: To assess a student's listening and speaking skills in English. Score Interpretation: This instrument yields a total converted score which is tied to l of 5 proficiency levels. Grade Range: K-5 Target Ethnic Group: General (reviewed for Cuban, Mexican-American, and Puerto Rican) Administration Time: Approximately 20 minutes; the prerecorded tape is timed. Administrator requirements: The administrator should have native proficiency in English. Authors: Source: Edward A. De Avila and Sharon E. Duncan Linguametrics Group, P.O. Box 454, Corte Madera, California 94925 69 Cost: This instrument is sold with a Spanish-language A LAS version, also reviewed in this Guide. examiner's kit which includes administration and scoring instructions, pictorial stimuli, a Spanish-English audio cassette, and 100 English and 100 Spanish score sheets costs $48.00. Additional score sheets are available at $5.00 per 100. These diagnostic instruments contain 100 items designed to assess phonemic production and discrimination, lexical production, sentence comprehension, oral production skills, and a student's ability to use language to attain specific goals. Instructions are given orally, and item stimuli are either taped or pictured in the test booklet. Students respond orally or by pointing. scored. Answers are hand Individual administration is required. A Language Arts Supplement containing follow-up learning activities and language games related to each test item is available from the publisher. Technical Information: The authors standardized this instrument with 308 5- to 12-year-old English-speaking students and report moderately high inter-rater reliability coefficients ranging from .84 to .94. Since this instrument measures 70 language proficiency, scores on this instrument show a low correlation with age. The reviewers gave excellent rat- ings to the administrator's manual because of the clarity of instructions and the ease with which directions could be delivered during a testing session. Cultural and Linguistic Information: Reviewers felt that the directions, item content, format, and procedures were highly appropriate for use with English-speaking Cuban, Mexican-American, and Puerto Rican students in grades K-5. They felt that some illustrations were too crowded and that others were too simplistic for fifth-grade students. APPENDIX B EXAMINERS' PERSONAL REACTIONS TO AND COMMENTS UPON THE TESTS AND TESTING PROCEDURE: Examiner I: (She is a bilingual language arts specialist.) Her comments: "In an overall evaluation all four inclusive instruments were moderately valuable in structure and content. The BINL was easily admin- istered and afforded each individual pupil an opportunity to respond according to their particular capabilities and experiences. The pictures were motivating and useful. The LAB was the least complicated and required little motivation or stimulus. concise. It was explicit and On the contrary, the LAS, I felt, was too general, covered too many areas, and was exhaustive to both examiner and student. It created a high frustration level for most of the participants. The students became very hesitant and inhibited. Also the test was too long for one sitting. The BSM was easy to administer and the content was more motivating than any of the other instruments in comparison. The material was relevant, inspiring and well presented. Of the four tests I felt this 72 to be the most valuable in all aspects." - Elizabeth Najarian Examiner II: (She is a bilingual language arts specialist with the California Bilingual/ Crosscultural Specialist Credential.) Her comments: ''BINL: this test was easy to give to NES pupils - but if they couldn't generate any English responses, they might have fest incompetent andjor mystified as to why they had to take this king of test in the forst place. Those pupils who could speak English to varying degrees, were able to generate English responses, but frequently all that they said were repetitions of, '(a/the) boy, girl' or, at a higher level, 'I see a , ' over and over again, using different nouns to fill in the blank. Fluent English speakers, of course, could usually find things to describe with ease. (In other testing, the BINL, given in both English and Spanish, was quite helpful in determining the language dominance of several puzzling cases of seemingly 'equally bilingual' pupils.) As for the testing conditions and scoring, recording the BINL test was difficult to give in regular classrooms as well as when given in a small room with only a few pupils who were present (and who served as distractors). The best and much less time-consuming testing was done in a small room with only the tester and the pupil present. Recording the test was time-consuming in the cases of poor and good English speakers alike; selected responses often had to be replayed again and again to hear the specific words which the pupils had utilized in their responses. Scoring procedures were technical but gradually I felt more at ease in determining which words to count, etc. Most of the pupils enjoyed hearing themselves on a tape recording. BSM: This test had colorful pictures which attracted the attention of many pupils. The test design showed that someone tried to be more humane than usual towards the test-taker; since all responses were not recorded, just selected ones, the pupils got the feeling that the tester was interested in their responses, not just in 'recording the answers' to their test. Unfortunately, if a pupil was truly NES, he/she couldn't benefit much from the test since he/she had to generate answers related to specific 'given' vocabulary. At least NES pupils could stop taking the test after a very few questions had been posed. More fluent speakers of ESL had more success with this test at times - and most pupils liked this test the best - or second best (vs the excit ment of being tape-recorded for the BINL). This test does measure a student's ability to give specific responses and sentence patterns - but limited ESL speakers might not be able to demonstrate their true ability to speak (understand) English based solely on this test. Recording students' responses was not much of a problem wit this test. LAS: This test has its good points which will he to diagnose and prescribe remediation for pupils' needs (i.e., phoneme recognition and pronunciatic vocabulary recognitions; generating vocabulary - naming items; and comprehension of detailed phraE It would be an excellent test to give to a FES o! PES level language speaker. But for NES and LES pupils, this is a difficult, time-consuming test which, to get any reasonably 'true' assessment, must be done in an isolated area, i.e., the teste and the pupil need to be alone together in a quie room where there is no distraction from people from noise 01 hardly what one might achieve ir normal classrooms or even small group testing arE 75 environments. The test is thorough, ending with the pupils having to listen to a rather strange story and, while looking at somewhat related pictures, retell the story . an impossibility for all the NES and many LES pupils. The testing procedure is awkward, switching from a tape to a booklet to a tape to the booklet with spiral binding, with pages that have to be quickly flipped, then turned over, during a taped portion of the test . at the same time during which the tester is to be marking down the pupil's responses . . . it's a mess! And the recording procedure is also misleading should a pupil see his/her score sheet: only incorrect items are marked, thus a pupil may get positive feedback from the incorrect answers . . . ! Ugh! Don't choose this test for any large scale testing to determine pupil dominance. LAB: This test, especially LAB I, was easy to give and seemed to hav~ some items which many pupils could answer, be they NES, LES, or FES. The vocab- ulary and test content was much more 'school-related' and closer to the vocabulary utilized with beginning ESL pupils, than any of the other tests which were used in this study. This possibly gave the pupils 76 more confidence - they certainly seemed to do better on this test than on the others. The testing procedure and the other of the test items was a bit awkward at times (the procedures change after 3-5 questions per page) but I liked the fact that all responses were marked either 'yes' or 'no', so that any response was counted, (as was no response). The computer scoring sheet had the test questions numbered in a way which was confusing to me at times - but this could easily be remedied. I'd choose the LAB for determining language dominance if it were paired with a BINL-like test of a pupil's ability to freely generate phrases in English and in the other language (Spanish)." - B. Rheingold Gerlicke Examiner III: (She is a student in bilingual education at California State University, Northridge and recommended by a member of my committee.) She comments: "When I first began with each child I first spoke to them about some activity or other unrelated to the test. up to me. This helped the child warm I also would explain that it wasn't really a test in the sense that he would pass or fail. That it would just show where we, as teachers, would 77 find out where we could help them a little more. This helped ease their curiosity or any fear they might have in failing. Because of this, I felt that each child did his best. In administering the four tests all at once I was able to see which I felt was more effective in determining the fluency of a child in the English language. I as also able to see the advantages and disadvantages of each. The BSM test was the one I felt less comfortable with because I felt the children were uncomfortable with it. They enjoy the colorful pictures but it is very limited and does not measure the child's fluency very accurately. A child may be very fluent but make grammatical errors. Also each question asked for a specific answer. The children were more interested in what I was writing. The LAB was a short quick test but also very limited. questions. Many children could not answer the If they would have been rephrased, I'm sure there would have been more of a response from the child. the BINL. I really felt the most comfortable with I could feel the children more at ease. I think they felt this way because I wasn't writing away. They soon forgot that we were using a tape 78 ~ecorder. The tape recorder made a big difference to me as the administrator of the test. I felt like I was having more of a conversation than a test with each child. I think the children felt this also. One big advantage to this test is that there are no limitations: the child uses the words he knows, and can also use his imagination if he wants to tell a story. For fluency I feel this is the best test. But having worked in a classroom, I found I liked the LAS very much. I feel this isn't an appropriate test for determining fluency, but more of a test on phonics, discrimination, auditory and memory skills or abilities. I would use this as a teaching aid. Each item of each section is numbered so the teacher can give supplemental work where ever the child may need it. The only disadvantage I found in using this test was that frequently the child could not understand the voice on the tape. In giving this test, I would prefer the BINL test and the LAS test to be the most effective. I would use both together; one to measure fluency and the other to find out in exactly what areas the child needs help, and use the worksheets as follow-up or teaching aids. I also feel I should add that I strongly feel that the person administering the test makes a big difference 79 in the outcome of each test." - Elena Romero 80 APPENDIX C Inasmuch as each of these tests was designed independently and the scoring systems were not necessarily intended to assign students to the four proficiency categories established by the Los Angeles Unified School District under the Lau Plan, i.e., Non-English Speaking (NES), Limited-English Speaking (LES), Functional-English Speaking (FES), and Proficient-English Speaking (PES). For the purposes of this study~ it was necessary to interpret the raw scores according to these categories. Even though there were cut-off points established by the test designers which logically divided the proficiency scores into 4 or 5 points along a continuum, the test designers did not, in all instances, interpret these categories as NES, LES, FES, or PES. We have taken the liberty of doing so for purposes of comparison. BINL SCORE RANGES Grades NES LES FES PES K-2 0-24 24.1- 52 52.1- 78 78.1-200 3-6 0-24 24.1- 78 78.1-101 101.1-200 7-8 0-24 24.1- 78 78.1-101 101.1-200 9-12 0-24 24.1-101 101.1-130 130.1-200 81 BSM SCORING RANGE Syntax Acquisition Index (SAI) Level 1 Monolingual Spanish No response Level 2 Spanish Dominant Respond in Spanish Level 3 Survival level SAI: 46-84 Level 4 Intermediate SAI: 85-94 Level 5 Proficient SAI: 95-100 LAS SCORING RANGE Level 1 Minimal Production 54 and below Level 2 Fragmented Production 55-64 Level 3 Labored Production 65-74 Level 4 Near Perfect Production 75-84 Level 5 Perfect Production 85-100
© Copyright 2026 Paperzz