Running head: Behavioral and Emotional Rating Scale 1

Running head: Behavioral and Emotional Rating Scale
1
Description, Evaluation, and Critique of the Behavioral and Emotional Rating Scale: A StrengthBased Approach to Assessment—Second Edition (BERS-2)
NAME
Chapman University
Behavioral and Emotional Rating Scale
2
Description, Evaluation, and Critique of the Behavioral and Emotional Rating Scale: A StrengthBased Approach to Assessment—Second Edition (BERS-2)
Description of the Test
The Behavioral and Emotional Rating Scale: A Strength-Based Approach to
Assessment—Second Edition (BERS-2) is authored by Michael H. Epstein, Ed.D., and it was
published by PRO-ED, Inc. in 2004 (the first edition was published in 1998). The BERS-2 is a
strength-based approach to assessment with information obtained that focuses on providing
professionals with the emotional and behavioral strengths of children. Buckley and Epstein
(2004) state that standardized resources to formally explore students’ strengths are limited, and
because assessments often focus on the deficits of a child, the BERS-2 is a useful approach to
assessment that aims to explore what many other assessments do not. With the information
collected, the BERS-2 is meant to create a positive context in which professionals can create
individualized mental health and/or educational services and monitor the outcomes of and an
individual’s progress with such services.
Content and Use
Practical Information and Features
The BERS-2 consists of three main components including the Examiner’s Manual, the
three rating scales—Teacher Rating Scale (TRS), Parent Rating Scale (PRS), and Youth Rating
Scale (YRS)—and a Summary Form. Although all three rating scales are to be used with most
children, they may not all be applicable for certain individuals such as a student who is not
attending school and would not have a TRS completed. Within the three rating scales are five
core subscales: interpersonal strength, family involvement, intrapersonal strength, school
Behavioral and Emotional Rating Scale
3
functioning, and affective strength. In addition, the PRS and YRS include a supplemental career
strength subscale.
After individuals complete their respective scales, they are then asked to complete
numerous “key questions” which can be found in each of the three scales. These key questions
are open-ended in nature, and they are meant to gather information regarding the child’s
strengths, preferences, and personal and family resources. An example of a key question is “this
student’s favorite hobbies or activities are…” Aside from the three rating scales and the key
questions, the BERS-2 also includes a summary form to be used when more than one rating scale
is completed which often is the case. The summary form allows the examiner to aggregate the
data, plot scores from the rating scales, and compare these scores.
Characteristics of the Test
The TRS consists of 52 items while the PRS and YRS consist of 57 items due to the
inclusion of the supplemental career strength subscale in these two rating scales. All items are
answered on a Likert scale ranging from zero to three (0 = if the statement is not at all like the
student, 1 = not much like the student, 2 = like the student, 3 = very much like the student). The
anchors are clear and the response format is consistent across the rating scales. The items on
each scale are generally similar with differences made depending on who is answering the items.
For example, on the TRS, item three states “accepts a hug,” while on the YRS, item three states
“it’s okay when people hug me.” The individual scales can be completed in a single session with
each generally completed within 10 minutes. The TRS can be completed by a teacher (or
teachers), or by a nonteaching individual who has frequent contact with the child such as a
counselor or school psychologist. The PRS should be completed by the child’s primary
Behavioral and Emotional Rating Scale
4
caregivers, and the YRS should be completed by the child if between the ages of 11 to 18 years,
at or above the fifth-grade reading level, and with no significant developmental delays.
Although the three scales individually contain between 52 and 57 items, each of the five
subscales is comprised of remarkably fewer items including 15 items for the interpersonal
strength subscale and seven items for the affective strength subscale. In addition, numerous
statements on the rating scales can be described as vague and ambiguous. For example, different
individuals may interpret statements such as “interacts positively with parents” dissimilarly. In
addition, having such few items to measure the five subscales also may skew the results. Rater
errors such as “halo effects” and “pitchfork effects” also may occur in such a rating scale format.
These features are important considerations as they can affect the results significantly. Because
the BERS-2 heavily relies on these rating scales, examiners may want to follow-up with raters
regarding their answers before making any interpretations and/or conclusions.
Aside from these potential issues, the use of information from multiple informants and
the use of a Likert-type scale with numerous options to choose from provide two significant
strengths to the BERS-2. Moreover, the test is user-friendly and easy-to-read. Each scale is
presented well with explicit directions, easy-to-understand statements with corresponding rating
options, and clear key questions to answer. It should be noted, however, that more than two
omitted or multiply-marked items per subscale deem that subscale unscorable. This may pose a
problem since the scales have either 52 or 57 items and individuals may accidentally miss some
items. Since only two mistakes per subscale are permitted, the examiner should carefully explain
such an issue to the raters and/or carefully watch the raters as they complete the scales.
The BERS-2 is to be administered, scored, and interpreted by an individual who
understands the theoretical foundations (e.g., strength-based assessment) and the construction
Behavioral and Emotional Rating Scale
5
and statistical characteristics of the BERS-2. Furthermore, the examiner should be experienced in
administering the BERS-2 and be familiar with how to interpret the results of the normative scale
data. The examiner’s manual recommends the examiner to conduct at least three practice
administration and scoring sessions. Despite these somewhat rigorous standards for
administering the BERS-2, many different professionals (e.g., classroom teachers, psychologists)
with sufficient knowledge and training may give the assessment. For those completing the rating
scales, individuals must be familiar with the child’s behavior in the respected environment, and
he or she should have had regular and daily contact with the child for at least a few months.
Characteristics of the Test Manual
While the BERS-2 may have some potential problems, it is a user-friendly and wellorganized measure. A description of and uses for the BERS-2 are presented and clearly laid out
in chapter one of the examiner’s manual. Reasons for and advantages of strength-based
assessment also are discussed in chapter one, allowing the reader to obtain valuable knowledge
regarding the BERS-2 before giving the assessment. Following this helpful chapter, the
subsequent chapters are nicely presented. The test manual flows from administration and scoring
procedures to interpreting results, normative information, test reliability, and test validity with
subsections within the chapters. Each chapter builds upon the previous one and contains
pertinent information for the examiner. Moreover, the tables and figures within the test manual
are useful and clearly organized. Lastly, the test manual is at an appropriate length of 87 pages
with four additional short appendices containing normative scores information. Based on these
aspects, the test manual can be described as well-designed and easily accessible for professionals
who are administering the BERS-2.
Behavioral and Emotional Rating Scale
6
Standardization Sample and Norms
The BERS-2 developers provide a total of four sets of normative data including two sets
for the TRS, one set for the PRS, and one set for the YRS. For the two TRS normative sets, the
first set is comprised of 2,176 students not identified with emotional or behavioral disorders (the
NEBD sample) while the second set consisted of 861 children who were identified with
emotional or behavioral disorders (the EBD sample). Both data sets were gathered for use with
the original BERS from February to November 1996 in the same 31 states, and the children in
these samples ranged in age from 5-18 years. Although the original BERS TRS data may have
been acceptable, developers should have considered obtaining new TRS data for the BERS-2.
The PRS and YRS data were collected from the fall of 2001 though the spring of 2002
with students between the ages of 5 and 18 for the PRS and between 11 and 18 for the YRS since
students below the age of 11 are not to complete the YRS. In addition, the PRS sample consisted
of 927 students in 34 states while the YRS sample was comprised of 1,301 students in 31 states.
Despite having participants from a large number of states, participants from each of the 50 states
would have been ideal. Further, it is unclear whether some of the same students or entirely
different students were included in the three rating scales. Lastly, it should be noted that many of
the normative samples had a sample size that was lower than the acceptable minimum number
per age group. In the TRS EBD sample, for example, only ages 14-16 had over 100 students.
Data were coordinated and collected for these norms by trained educators throughout the
United States. BERS-2 developers selected these trained educators through three methods: (1) by
contacting professionals in the field, (2) by searching customer files to locate professionals who
had purchased the original BERS (or similar products) within the past two years, and (3) by
asking professionals who had collected data in previous norming projects. During this process,
Behavioral and Emotional Rating Scale
7
developers also considered the prospective data coordinators’ access to the children based on
their geographic region, Hispanic ethnicity, and race. The selected coordinators were trained in
administration of the BERS-2 and collected data from both parents and students. Additional
information is not given about these trained educators such as their sex, ethnicity, education
level, or occupation, which makes it difficult to consider any possible biases or issues with data
collection. Furthermore, because a convenience sample was used, random sampling did not
occur which also may introduce bias in the sample.
A breakdown of characteristics and the percentage of the sample with each characteristic
for the four normative data sets are included as helpful tables in this chapter. The tables include
the following characteristics—geographic area, gender, race, Hispanic ethnicity, family income,
educational attainment of parents, disability status, and age—with a comparison between the
percentage of the sample with each characteristics versus the percentage of individuals with that
characteristic in the U.S. population. The samples’ characteristics were compared with those for
the school-age population from the U.S. Bureau of the Census 2001. Based on these
comparisons, BERS-2 developers state that the samples were representative of the general
school-age population.
Demographic information was later stratified by age for the TRS NEDB sample to further
show representativeness. Interestingly, however, only four of the original eight demographic
characteristics were stratified during this process, which include geographic region, gender, race,
and ethnicity. In addition, sample data for the TRS NEBD was the only data set stratified.
Although this stratification did “conform to national expectations at each age included in the
test’s norms,” it would have been interesting and helpful to see data stratified for all four data
sets (Epstein, 2004, p. 40).
Behavioral and Emotional Rating Scale
8
Following stratification, developers weighted the samples for the PRS and the YRS on
three sampling variables (race, Hispanic ethnicity, and geographic region) during later normative
development. Weighting data for the TRS (collected in 1995 and 1996) was deemed unnecessary
as this data were already representative of the U.S. school-age population. It is interesting that
developers chose to weight the data as the unweighted sample data were closely similar to the
U.S. population; however, developers state that weighted samples were used to achieve a more
“proportional” distribution to the U.S. school-age population.
The PRS and YRS sample data were weighted in a two-step process. An overall weight
was first determined for each race, Hispanic ethnicity, and geographic region by dividing the
variable’s percentage in the U.S. population by the percentage in the sample. In the second step,
“a weight was obtained for each individual in the sample by multiplying the weight for his or her
race by the weight for his or her geographic region by the weight for his or her ethnicity”
(Epstein, 2004, p. 42). Based on these weighted normative samples, BERS-2 norms were
derived.
Scores and Interpretation
Chapter three of the test manual provides clear, detailed, and helpful information
regarding the interpretation of BERS-2 results. The chapter is divided into five main sections
with information including instructions for completing the BERS-2 summary form and the three
rating scales, a discussion of the various kinds of scores and what they mean, useful information
to understand before making a diagnostic decision, information for interpreting deviant scores
and finally, suggestions for professionals when sharing a child’s results on the BERS-2.
Although all the information in this chapter is beneficial, this section will largely focus on a
discussion of the scores used for the purpose of this paper.
Behavioral and Emotional Rating Scale
9
The summary form of the BERS-2 is to be used when more than one rating scale is
completed, which likely will be the case with most students. Developers include a helpful
example of what a completed summary form may look like using a hypothetical student. This
example is used throughout the chapter as a consistent written and visual example for readers
when discussing information. On this summary form, examiners record a student’s raw scores
from the TRS, PRS, and YRS and three kinds of normative scores: his or her percentile ranks,
scaled scores (also known as standard scores), and the BERS-2 Strength Index (the composite
score). In the subsection titled “Types of Test Scores,” developers provide clear and concise
information regarding these three types of scores.
First, a student’s raw scores are described as the sum of the child’s ratings for the items
of each subscale. Developers state that it is pertinent to double-check raw score results as they
are the basis for calculating standard scores. Second, a short yet critical discussion of percentile
ranks is presented. The normative tables for converting the subscale raw scores into percentile
ranks can be found in Appendix A of the test manual. Percentile ranks are mentioned as being
simplistic and easily understood as they provide a ranking from one to 100, but developers are
quick to assert an honest discussion of percentile ranks and their weaknesses. For example, it is
stated that percentile ranks are not interval data and “equal differences in percentile ranks do not
represent equal differences in the attributes or behaviors being measured” (Epstein, 2004, p. 26).
Third, subscale scaled scores are discussed. Like percentile ranks, information for
converting raw scores into scaled scores can be found in Appendix A. Scaled scores are
described as more useful and more valuable than percentile ranks since they are standard
deviation units, can be compared to the average rating of the normative sample, and can be
compared to other standard scores. Thorough and useful information regarding scaled scores is
Behavioral and Emotional Rating Scale
10
discussed and it is evident that the developers advocate for the use of standard scores over
percentile ranks. Despite their usefulness, readers and practitioners should be aware during
interpretation that ordinal data was gathered in the BERS-2 (i.e., data using a Likert-type scale),
which was then used with interval data (the scaled scores). Furthermore, developers did not
include the standard error of measurement (SEM) or the confidence intervals (CIs) for the scaled
scores, which are needed for an accurate interpretation.
Last in this subsection is an explanation of the BERS-2 Strength Index. It is a type of
standard score derived by summing the scaled scores of the subscales and converting this sum to
a standard score; the table needed for this conversion is found in Appendix C. This score is
described as the most reliable and the most useful of all the scores as it provides an overall rating
of the child’s strengths. Additionally, this score also can be compared with other tests’ standard
scores, and developers provide an easy-to-understand explanation for these types of
comparisons. Following a discussion of these scores, developers move to additional valuable
discussions. For example, it is mentioned that tests and/or standard scores alone do not diagnose
anything, disagreements among the raters should be identified and discussed, and additional
pertinent evidence from other tests, interviews, and observations should be considered before
coming to any conclusions. Ultimately, this chapter was thorough and well presented. It provides
a strong foundation for interpreting test results, and it does not appear to overstate the importance
or use of certain scores.
Psychometric Properties
Reliability
The BERS-2 test reliability was assessed in three ways: content sampling, time sampling,
and interrater reliability. Developers state that “error associated with content sampling reflects
Behavioral and Emotional Rating Scale
11
the degree of homogeneity among items within a test or subscale,” and items that are more
related to one another contain smaller error (Epstein, 2004, p. 49). Content sampling is a type of
internal consistency estimate, and it was investigated using Cronbach’s coefficient alpha method.
This method is appropriately used as Cronbach’s alpha is meant for a multiple response format
like the BERS-2 rating scales. Coefficient alphas were calculated for the five subscales and the
composite (Strength Index) using the four normative sample data sets and for all age groups
within the samples. In addition, Guilford’s formula was used to derive coefficient alphas for the
composite as this is the formula’s purpose. Lastly, the z-transformation technique was used to
average the coefficients.
Coefficient alphas for the five subscales and composite scores within the four normative
groups all fall within the .70 to high .90 range, and the average coefficients equal or exceed .80
with the exception of the Career Strength subscale on the YRS (.79). Further, coefficients for
each of the composite scores and the average composite scores equal or exceed .95. Although the
alphas for the averages are acceptable in general, developers do not address the individual
subscale alphas, which show some alphas within the .70 range. Since that is considered low for
clusters, practitioners should be made aware. Despite this, alphas for the composites in each age
and in each of the rating scales are highly acceptable deeming the overall score a student receives
to be reliable for educational and clinical decisions.
In addition to these coefficient alphas, developers are forthright in explaining that
although an assessment may be reliable for a general population, this cannot be assumed for each
subgroup within the population. Because of this, developers provide coefficient alphas for
subgroups that represent a “broad spectrum of ‘mainstream’ and ‘minority’ populations”
(Epstein, 2004, p. 53). Alphas are reported for six subgroups within the normative sample (male,
Behavioral and Emotional Rating Scale
12
female, White, Black, Hispanic, and students with emotional disturbance) for the five subscales
and the Strength Index of each rating scale. The number and ages of students in each subgroup
are provided and shown to be acceptable. In general, coefficient alphas equal or exceed .80 for
the subscales, although some alphas fall within the .70 range. All alphas for the Strength Index,
however, equal or exceed .95, and one can conclude that the BERS-2 is equally reliable for these
subgroups.
Following content sampling, developers discuss the standard error of measurement
(SEM), which is used to estimate the confidence interval for a test score and the average amount
of error around a measure. BERS-2 developers provide SEMs for all age groups in the four
normative samples in a series of helpful tables, and each table reflects SEMs below three. Due to
these small SEMs, it is assumed that one can be more confident in a child’s results. Despite the
valuable discussion and inclusion of SEMs, developers also should have provided explicit
information regarding confidence intervals (CIs), a clearer formula for CIs, and possibly the CIs
themselves for an easier and more precise interpretation.
Time sampling error was the next source of error discussed, and developers used the testretest method to investigate temporal stability. Time sampling error is accurately described and
developers provide rationale for the importance of testing for such error. Six studies were
conducted to check for stability over time, and detailed demographic characteristics for the six
studies are provided. Although it is commendable that six studies were conducted, there is a
limited age range for each study, a limited geographic area, and a small number of participants
with the exception of study two. In addition, only one study was conducted using EBD students,
and students aged 10 and 13 were not included in any of the studies.
Behavioral and Emotional Rating Scale
13
Despite these limitations, coefficients for each test-retest group showed strong results
overall. Coefficients for study one and studies four, five, and six all exceed .80. Because
guidelines for the test-retest method with a one to two week interval state that coefficients should
be .80 or above, these results are considered acceptable. For studies two and three with a sixmonth interval between ratings, coefficients generally exceed .60. Although practitioners should
be aware of the two lower coefficients at .53, these results also can be described as acceptable.
Ultimately, practitioners can be confident that the BERS-2 is temporally stable.
Lastly, interrater reliability was investigated, and developers provide a brief discussion of
interrater reliability and its potential problems. Three separate studies were used to investigate
interrater reliability, and demographic characteristics for each study are presented in a table. In
study one, nine pairs of special education teachers rated 96 students ages 14 to 18 with serious
emotional disturbance (SED). In study two, teachers and parents rated 20 students ages 7 to 16
with SED who were receiving special education services. Lastly, in study three, parent and
student ratings were assessed for 296 students from the normative sample ages 10 to 18.
Numerous limitations can be noted for each of these studies. Aside from the limited age ranges
and limited geographic area for each study, it is questionable why teacher ratings and teacherparent ratings in the general population are not analyzed. Moreover, parent-parent ratings are
excluded entirely. In addition, the teacher-parent study with SED students (study two) consists
only of 20 students. Moreover, race and ethnic group data were not collected for study one and
study two. Lastly, helpful demographic characteristics for the teachers and parents are not
provided. One should consider such limitations when interpreting the results.
For study one (teacher-teacher ratings), coefficients are large with each exceeding .80.
For study two (parent-teacher ratings), most coefficients fall between .54 and .67, although the
Behavioral and Emotional Rating Scale
14
intrapersonal strength subscale had a coefficient of .20. Developers state that this small
coefficient is “not unexpected because internalizing types of behavior problems typically
demonstrate smaller relationships” (Epstein, 2004, p. 59-60). Aside from this, the coefficients in
study two are acceptable. For study three (parent-youth), all coefficients equal or exceed .50.
These can be described as generally strong results. Ultimately, practitioners should keep in mind
the numerous limitations of the studies, however, a summary of reliability results is discussed
and valuable tables are provided portraying the BERS-2 as an overall reliable measure.
Validity
BERS-2 developers provide an extensive discussion of test validity in chapter six of the
test manual. Three types of validity are examined: content validity, criterion-related validity, and
construct validity. Content validity is stated as an analysis of the test content to ensure the
material is representative of the behavior(s) to be measured. Developers employed three methods
to evaluate the BERS-2 content validity. First, a detailed and statistically supported rationale is
provided for the content and format of the original BERS subscales plus the BERS-2 Career
Strength subscale. Item discrimination and factor analyses were two additional methods used
during this process.
This discussion describes the thorough process developers used to create the 68 items and
the five subscales of the BERS-2. Despite this process, developers fail to provide any important
demographic characteristics for the professionals they contacted at multiple time points to review
the BERS-2 items aside from the sample size of these professionals (sample sizes ranged from
250 to 400). Furthermore, for the item discrimination method, the sample sizes for the two
groups of students (with and without SED) used was acceptable, but only 26 teachers were asked
to rate these students. In addition, developers do not provide the actual data/numbers when
Behavioral and Emotional Rating Scale
15
claiming that the ANOVAs were significant for each item and could accurately discriminate
between students with and without SED. Aside from this lack of information, the rationale given
provides strong evidence for content validity.
Following this, developers use classical item analysis procedures—a method for choosing
test content—to explore content validity. Classical item analysis is used to assess features such
as item discrimination. In the step mentioned above, item discrimination was used more as a tool
to select and omit items that were either useful or not useful in discrimination. In this section,
item discrimination was assessed to identify whether the BERS-2 items could actually
discriminate between students. For this method, item analysis using the normative samples as
subjects was conducted for all age levels within the three rating scales. Median item coefficients
for the three rating scales indicate strong content validity as all coefficients greatly exceed .35—
the value identified by Epstein as acceptable for item discrimination indexes (as cited in Ebel,
1972; as cited in Pyrczak, 1973). Uhing, Mooney, and Ryser (2005) conducted a separate study
to assess differences in scores for students with and without EBD across the YRS and PRS. In
their two studies to analyze BERS-2 discriminative ability, t-tests and effect sizes similarly show
the BERS-2 to be valid for differentiating between populations.
Lastly, differential functioning analyses to check for content validity are used to support a
lack of bias in BERS-2 test items. The logistic regression procedure was used to detect
differential item functioning (DIF). A significance level of .001 was selected and three of the
four normative samples were used in the analyses (the NEBD sample was used for the TRS and
the EBD sample was not used). Comparisons were made between two dichotomous groups—
Black versus other and Hispanic versus other—for the five subscales in each of the rating scales
and in total, 332 comparisons were made. Sixteen of these comparisons were found to be
Behavioral and Emotional Rating Scale
16
statistically significant (i.e., containing bias) with moderate to large effect sizes, but ultimately
developers claim that less than five percent of the comparisons were significant, which shows
little bias in the BERS-2. Although this may be true, it would have been helpful for developers to
provide more rationale for the conclusion of little bias. Moreover, comparisons between more
than these two dichotomous groups such as between SES levels also would have been a helpful
addition. Between these three in-depth methods, however, one can say that the BERS-2 contains
appropriate content validity.
Criterion-related validity is the next type of validity examined in this chapter. Criterionrelated validity can either be concurrent or predictive. For this measure, developers used
concurrent validation in their analyses as they stated that this was the more appropriate type for
behavioral assessments such as the BERS-2. Concurrent validity was analyzed in the three rating
scales separately. For the TRS, six studies were conducted to evaluate this validity. A detailed
table with each study’s demographic characteristics is provided, and sample sizes range from 91
to 382 participants. Despite the acceptable sample sizes and relatively good age ranges, the
geographic area in each study is limited once again. Further, only one study was used with
students with SED and one study with students with EBD. The BERS-2 was correlated with six
measures including the Systematic Screening for Behavior Disorders and the Social Skills Rating
System. The correlations are displayed in another detailed table showing moderate correlations
ranging from .44 to .62. These correlations indicate adequate validity for the TRS.
For the PRS, two studies were conducted to evaluate concurrent validity. Both of these
studies had smaller sample sizes (n=55; n=85), virtually no students identifying as Black,
Hispanic, or “other,” no students between the ages of 14 and 18, and no students in special
education or with an EBD or SED diagnosis. The BERS-2 was correlated with the Child
Behavioral and Emotional Rating Scale
17
Behavior Checklist and the Social Skills Rating System. Results show correlations ranging
between .53 and .67. Although these are large correlations, there are many limitations for these
two studies and results should be interpreted cautiously. More studies with more representative
samples should have been used, as they would have made the concurrent validity for the PRS
stronger.
The final rating scale analyzed for concurrent validity was the YRS. As with the PRS,
concurrent validity was evaluated with two studies. Age ranges are especially restricted in these
two studies (13-14 and 12-13) since the YRS is to be completed with students ages 11-18, and
the sample sizes are small (n=42; n=49). Again, students identifying as White are predominant
in these studies. The YRS was correlated with the Youth Self-Report and the Social Skills Rating
System, and correlations show a wide range between .33 and .64; however, this indicates
adequate validity for the YRS.
Construct validity, used to assess the overall validity of a measure, was the third and final
type of validity examined for the BERS-2. Developers describe construct validity as “the degree
to which underlying traits of a test can be identified and the extent to which these traits reflect
the theoretical model on which the test is based” (Epstein, 2004, p. 78). A three-step procedure
was used in this process: (1) numerous constructs thought to account for test performance were
identified, (2) hypotheses were generated based on these constructs, and (3) the hypotheses were
verified. Additionally, group differentiation, subscale interrelationships, and factor analysis are
explored as part of this process.
Group differentiation, or analyzing the performance of different groups of people on an
instrument, is one way to test for construct validity. Two validity studies were conducted to
assess group differentiation. In the first, the same professionals who helped to norm the BERS-2
Behavioral and Emotional Rating Scale
18
rated participants. Participants included the total sample used to norm the BERS-2 rating scales
and six subgroups (male, female, White, Black, Hispanic, and students with EBD). It is unclear
how the participants in the six subgroups were obtained. Ultimately, the mean standard scores of
the total sample and of the six subgroups support the construct validity of the BERS-2 with
average mean scores for the subgroups and below average scores for the EBD subgroup. It
should be noted, however, that although the EBD group did score below average, the lowest
score on the subscales was not too far below an “average” score and some scores were within the
average range; the scores within the average range, however, were all on the YRS. The EBD
group’s scores on the composite also fell below what is considered average with scores ranging
from 74 to 88 (90 to 110 is average), although the highest composite score also is not far away
from the average.
For the second of these two studies, developers were interested in determining if the
BERS-2 could discriminate between students without disabilities and those with learning
disabilities or behavioral disorders. Participants were a total of 418 students who were matched
on age and gender, and over 100 students were in each of the three groups. General education
teachers rated each of the participants, and significant differences between the groups were found
with students with behavioral disorders obtaining the lowest ratings on the five subscales and the
Strength Index. Despite these findings, it is interesting that general education teachers were used
for each of the groups. It may have been more appropriate to have special education teachers rate
some of the students if they were primarily in special education.
Subscale interrelationships were the next type of method used to check for construct
validity. Developers claim that subscales should be significantly correlated since they all
measure some aspect of behavioral/emotional strength, but they should not be correlated too
Behavioral and Emotional Rating Scale
19
highly as that would indicate the scales do not contribute anything unique. Correlations between
the subscales were obtained and all were shown to be significant beyond the .01 level.
Coefficients range from .37 to .87 with the mean at .65. The mean coefficient is considered in the
acceptable range which shows that the BERS-2 subscales are related, but do measure different
aspects of behavioral and emotional strengths. A longer discussion for why and how this
conclusion was made would be helpful, and without one, readers without a statistical background
must take the developers’ word at face value. Still, these results provide evidence for construct
validity.
The last method used for construct validity was factor analysis, which is an advanced
correlational measure for multiple measures. Exploratory factor analysis with a Promax rotation
was used to analyze the TRS and confirmatory factor analyses were used for both the PRS and
YRS. According to the exploratory factor analysis, five factors were identified (the five
subscales). Further, results from the confirmatory factor analyses reveal that the BERS-2
Strength Index was identified as a valid underlying influence on the five subscales for the PRS
and the YRS. Within this discussion for factor analysis, developers mention “factor loadings,”
“eigenvalues,” and “goodness of fit” multiple times, although they never provide a clear
explanation of what these terms actually mean. For individuals without a strong background in
statistics, this discussion appears confusing and difficult to dissect. Although these
aforementioned tests to establish BERS-2 test validity contain some limitations, multiple studies
including those in outside research (e.g., Benner, Beaudoin, Mooney, Uhing, & Pierce, 2008)
support the overall validity of the BERS-2 in identifying the behavioral and emotional strengths
of children.
Behavioral and Emotional Rating Scale
20
Conclusion
The BERS-2 is a well-formatted and user-friendly assessment. It is one of the only
assessments used to identify the strengths and abilities of students, and it is especially helpful for
professionals in creating a positive context for individualized services. The BERS-2 gathers
information from multiple informants and allows for informants to make additional helpful
comments about a student with the “key questions.” Norming procedures were thorough and led
to representative samples, and scoring guidelines are nicely laid out and easily interpreted.
Despite the limitations in some of the BERS-2 reliability and validity tests, the BERS-2 has
proved to be a reliable and valid measure overall. Based on this, the BERS-2 is an appropriate
measure to use when identifying students with emotional or behavioral problems; however,
BERS-2 developers were open in stating that it should be used alongside additional important
data, and follow-ups with informants regarding their ratings should occur. In doing so, the
BERS-2 can be a helpful first-step in an assessment process.
Behavioral and Emotional Rating Scale
21
References
Benner, G.J., Beaudoin, K., Mooney, P., Uhing, B.M., & Pierce, C.D. (2008). Convergent
validity with the BERS-2 teacher rating scale and the achenbach teacher’s report form: A
replication and extension. Journal of Child and Family Studies, 17, 427-436. doi:
10.1007/s10826-007-9156-z
Buckley, J.A., & Epstein, M.H. (2004). The behavioral and emotional rating scale-2 (BERS-2):
Providing a comprehensive approach to strength-based assessment. The California
School Psychologist, 9, 21-27. Retrieved from http://education.ucsb.edu/schoolpsychology/CSP-Journal/index.html
Epstein, M.H. (2004). Behavioral and Emotional Rating Scale: A Strength-Based Approach to
Assessment—Second Edition (BERS-2). Austin, TX: PRO-ED, Inc.
Uhing, B.M., Mooney, P., & Ryser, G.R. (2005). Differences in strength assessment scores for
youth with and without ED across the youth and parent rating scales of the BERS-2.
Journal of Emotional and Behavioral Disorders, 13, 181-187. Retrieved from
http://www.ingentaconnect.com.libproxy.chapman.edu/content/proedcw/jebd