paper - Occidental College

Exploring race and gender differentials in student ratings of instructors:
Lessons from a diverse liberal arts college
Robert L. Moore, Hanna Song Spinosa, James D. Whitney
Occidental College
April 2014
Abstract: This paper explores differences in student ratings of instructors by the race and gender
of the instructor and also by the race-gender composition of students in each class. Our dataset is
the largest and most recent in the literature to date, consisting of 74,072 student course
evaluations submitted for 4,297 undergraduate classes taught by 443 instructors over Academic
Years 2006–2012 at Occidental College, as well as detailed information on the instructor and the
students enrolled in each class. Our paper differs further from previous research spanning a
variety of disciplines in several important ways: 1) we explicitly focus on the race as well as the
gender of instructors; 2) we examine a college with relatively high levels of race and gender
diversity of both students and faculty; 3) we analyze our data using the econometric techniques
that distinguish the empirical approach of the economics discipline, controlling for many nondemographic factors that can affect student ratings; and 4) we supplement our core econometric
methodology with a Oaxaca decomposition and with a subsample analysis of 440 multi-section
courses taught contemporaneously by a single instructor. Our main findings include the
following: 1) overall class-average ratings differentials by instructor race and gender do not
appear large enough in general to play a material role in Occidental College's tenure and
promotion decisions; 2) for thorough analysis of potential race and gender ratings differentials, it
is important to take into account not only the race and gender of class instructors but the
demographic composition of the classes they teach as well; 3) several estimates of disaggregated
student-instructor pairings (for example, white male student ratings of white female instructors,
Page 2 of 35
and so on) are sizeable but only a few are statistically significant; and 4) credible and robust
empirical results rest on a foundation of careful controls that include non-demographic factors
that can affect student ratings of instructors, potential heterogeneity of respondents, and
clustering of the data by class and instructor.
I. Introduction
Economists have explored a variety of cases in which race and gender differentials raise
concerns about discriminatory outcomes, including wages and employment in labor markets,
redlining in insurance markets, and fair housing in real estate markets. Similar concerns may
arise in academic markets as well. Student ratings of instructors (SRIs) constitute potentially
useful data to explore this issue, and abundant research in other disciplines utilizes SRIs to
examine ratings differentials by gender but almost none addresses ratings differentials by race.
Only a small amount of similar research appears in the economics literature to date, and what
exists is almost exclusively focused on ratings differentials by gender, not race. The perspectives
of varying disciplines enrich our collective understanding of important socioeconomics issues,
and the distinctive contribution that economists can make derives from the econometric tools that
the discipline applies to empirical research. That is the approach we take in this paper.
At a practical level, tenure and promotion decisions depend in part on teaching
effectiveness, and student ratings of instructors typically play an important role in the evaluation
of teaching effectiveness. So, prospectively, ratings differentials by race and gender can matter
for the career prospects of faculty. More broadly, student ratings of instructors may be able to
help inform the ongoing social concern with the persistence of race and gender discrimination.
Since SRIs are quantitative measures, they are amenable to statistical analysis. And since they
are typically anonymous, they constitute a data source of responses that are likely to be relatively
Page 3 of 35
free of self-censorship. Drawn from a relatively young population, they might also help provide a
leading indicator about the future direction of social attitudes.
The relatively high level of race and gender diversity of both students and instructors at
Occidental College allows for a correspondingly wide range of demographic configurations of
instructors and class composition, which facilitates statistical estimation of ratings differentials
by the race and gender of instructors in combination with the demographic make-up of the
classes they teach. The extent to which race and gender differentials, even after controlling for
non-demographic factors that can also affect outcomes, correspond to discrimination remains a
challenging empirical issue. The differentials may reflect actual learning differences (Hativa,
2013), such as role-model effects, rather than discrimination per se. Or they may reflect student
reaction to differential treatment by their instructors rather than bias on their own part.
Nonetheless, the first step to teasing out possible discrimination entails the estimation of the
differentials themselves, and that is our aim here.
The key conclusions that emerge from analysis of our particular dataset are mixed. In only a
few cases are the empirical estimates of race and gender ratings differentials statistically
significant, although in several cases of disaggregated student-instructor pairings (for example,
white male student ratings of white female instructors, and so on) these estimates are sizeable.
Moreover, demographic enrollment patterns tend to dilute the impact of these disaggregated
estimates on the overall class-average ratings of instructors. The end result is that class-average
student ratings in our dataset do not differ enough by instructor race and gender to warrant
systematic ratings adjustments for tenure and promotion decisions, but do warrant a general
attentiveness to particular teaching situations in which instructor and student demographics
might matter. Our clearest findings are cautionary observations regarding the challenges that
Page 4 of 35
research into the issues addressed in this paper must surmount in order to generate empirical
results that are credible, robust and statistically significant.
We briefly review the literature most directly related to our own research in Section II. We
describe our data and methodology in Section III, and our key empirical results in Section IV.
We conclude in Section V by highlighting our most important findings and offering some
suggestions for future research.
II. Literature review
There is a vast literature on student ratings of instructors and how such ratings might, or
might not, relate to teaching effectiveness. Hativa (2013), in her recent book, Student Ratings of
Instruction: Recognizing Effective Teaching, lists no fewer than 139 references related to this
topic. Included in this list are well over 40 citations concerning whether or not gender biases
student ratings of instructors (SRIs). On the other hand, we found very few published studies on
differences in student ratings of instructors by instructor race/ethnicity, an issue that our data and
research methods allow us to explore in significant detail. Indeed, this issue isn’t even mentioned
in the chapter of Hativa’s book that explores how various factors beyond the instructor’s control
can affect student ratings of instructors, such as class size, academic discipline, gender of the
instructor, etc. Our examination of the 139 references cited in Hativa revealed that only one
(Hamermesh and Parker, 2005) was even tangentially related to the issue of student ratings of
white versus minority instructors.
Our own literature search uncovered only five published studies related to the issue of
student ratings of instructor differentials by race of the instructor: Hamermesh and Parker (2005),
the most closely related prior study for our own work on this topic), Smith (2007), Smith and
Hawkins (2011), Anderson and Smith (2005), and Smith and Anderson (2005), with only the
Page 5 of 35
first three using empirical evidence from actual courses.1 Two of these three studies, Smith
(2007) and Smith and Hawkins (2011), were quite similar to each other in that they appear to use
a very similar data set of student evaluations from the College of Education at a research
institution in the Southern U.S. Both compared the average student ratings on two “global” items
(overall value of course and overall teaching ability) for White, Black, and “Other” instructors,
as well as the average ratings on 26 multidimensional items which address specific topics or a
single aspect of instruction. One of the studies was based on 13,702 undergraduate student
evaluation forms over a three-year period for 190 tenure-track faculty, 83% of whom were
classified as White, 12% as Black, and 5% as “Other” (Asian, Latino, or Native American). The
other study included a student sample about double that size and included graduate courses as
well over a three year period, 2001-2004, and apparently included the same group of 190 faculty.
For both studies, the authors concluded that “Black faculty received lower average ratings than
White faculty and faculty identified as “Other” for both the multidimensional and the global
items.” The average ratings for the multidimensional items were closer for Black and White
faculty than were those of the global items. More specifically, in the study of undergraduates
alone (Smith and Hawkins, 2011) the average rating for overall teaching ability on a five-point
descending scale was 4.08 for White faculty, 3.44 for Black faculty, and 4.22 for Other racial
1
The other two studies examined the influence of professor and student characteristics on students’ perceptions
of college professors based on a hypothetical syllabus for a social science course on “Race, Gender, and
Inequality.” The syllabus was constructed to vary by teaching style, professor ethnicity and professor gender and
used fictional names to represent Latino versus White professors, e.g, Lopez vs. Saunders. Students of different
ethnicities and gender were then asked to rate the “instructors” of these hypothetical courses based on the
syllabus.
Page 6 of 35
groups. In the study that included both undergraduates and graduate classes (Smith, 2007), the
means were 4.25 for White faculty, 3.65 for Black faculty and 4.16 for faculty from Other racial
groups. The authors indicated that these differences in mean ratings were statistically significant
at the 5% level for black vs. white faculty The authors argued that “the lower student ratings on
the global items….were especially troublesome because these ratings have the power to affect
faculty merit increases and careers” (Smith and Hawkins, 2011: 159). However, neither of these
studies included (1) control variables for factors other than race of instructor that might influence
student ratings or (2) statistical adjustments that apply to observations drawn from recurring
survey data.
Hamermesh and Parker (2005) addressed both of these issues in their study, incorporating
instructor race into a multivariate regression analysis framework. The dataset covered 463
courses taught by 94 instructors at the University of Texas at Austin during the 2000-2002
academic years, with class-average ratings of instructors by over 16,000 total student
evaluations. Only about 10% (numbering 9 or 10) of these instructors were classified as
“minority.” Other than minority status, control variables for teacher characteristics included the
instructor’s gender, whether they were on tenure track, and whether they were educated in a nonEnglish speaking country, in addition to the key instructor variable of interest in their particular
study, namely the instructor’s composite beauty rating by students. Control variables for course
characteristics included class size and whether the course was upper or lower division. The
regression results indicated that, holding everything else the same, minority faculty were rated
lower than white faculty at the University of Texas at Austin over this period of time, on the
order of 0.25 points on a five-point scale (amounting to about 0.5 standard deviations).
Page 7 of 35
As for the previous research on gender and student ratings of instructors, Hativa first
summarizes a meta-analysis of approximately 36 studies conducted by Feldman (1993) by
indicating that the majority of studies reported no significant differences in student ratings of
instructor by gender of instructor. She also notes that “most other reviews of studies of genderSRI relationships have also concluded that these ratings have no strong or regular pattern of
gender-based bias (Algozzine et al., (2004), Arreola, (2000), Cashin (1995), Feldman, (1992),
Gravestock & Gregor-Greenleaf, (2008), and Theall & Franklin, (2001),” (Hativa, 2013: 81).
Many of these studies use small samples and have few controls for other factors that have been
shown to affect student ratings. An exception is that of Centra and Guabatz (2000) which utilizes
data from the Student Instructional Report II developed by the Educational Testing Service
(ETS) covering 741 classes in eight major discipline groups across about 20 different institutions
that use this Student Instructional Report. The dataset is one of the few that includes the gender
of the student who completes each evaluation form. The authors apply multivariate analysis of
variance and conclude,
Is there Gender Bias in Student Evaluations of Teaching? The results reflect some same gender preferences,
particularly in female students rating female teachers. But the differences in ratings, though statistically
significant, are not large and should not make much difference in personnel decisions. Moreover the higher
evaluations received by female teachers from females, and in some instances from males as well (Natural
Sciences in particular), could well be due to differences in teaching styles. Women in this study were more
likely than men to use discussion rather than a lecture method, and as a group they appear to be a little more
nurturing to students, as also reflected in certain scales in this study (p.32).
Hamermesh and Parker (2005) included control variables for instructor gender as well as race
in their multivariate regression analysis. They found that, holding the other variables constant,
female faculty were rated lower than males by approximately 0.24 ratings points (about 0.5
standard deviations), a statistically significant gap at the 5% level. The authors noted that this
result departed from the consensus in the literature on this question, i.e., that there is no
statistically significant relationship between instructor gender and student ratings of instructors.
Page 8 of 35
Our own search of the literature uncovered two other articles by economists, Anderson and
Siegfried (1997) and Saunders and Saunders (1999), that examined samples of Principles of
Economics courses and focused on the interaction between student ratings of instructors and the
gender of instructors and students. Both studies applied econometric analysis and utilized the
Test of Understanding of College Economics (TUCE III) data set. Saunders and Saunders (1999)
also examined data from Indiana University Principles of Economics sections taught by associate
instructors. Both data sets include some instructor characteristics in addition to gender,
demographic and other information for the individual students completing the ratings, and exambased measures of student learning. The authors find statistically significant evidence of same
gender preference only for the Indiana University data set for Principles of Microeconomics
courses. The finding is not consistent over time and does not emerge from their analysis of
Indiana University Macroeconomics courses or for the TUCE III samples of 20 micro and 19
micro classes. Anderson and Siegfried (1997), using 1990 TUCE III data for 87 Principles of
Macroeconomics and 80 Principles of Microeconomics classes at 53 institutions, conclude that
when compared to student learning, the evidence we summarize from student ratings…reveals no evidence of
student bias against female instructors. If anything, there is some evidence that in micro students rate female
instructors higher than male instructors while learning similar amounts from each and, in macro, students rate
male and female instructors similarly in spite of learning less in the classes [taught by] women (pp. 355-6).
III. Data and Methodology
Our dataset consists of all student evaluations that include an overall student rating of
instructor submitted for Occidental College full-credit classes (counting for 4 or more units) with
enrollments above 5 students during the seven academic years from 2006 to 2012. The dataset
totals 74,072 evaluations submitted for 4,297 classes taught by 443 instructors. Students fill out
the individual course evaluations anonymously, so the form lacks information regarding the race
and gender of individual respondents. However, information from the College’s Registrar’s
Page 9 of 35
Office enabled us to calculate the overall race and gender composition of the students enrolled in
the class. For each instructor, we added information provided by Occidental’s Office of Human
Resources regarding their race and gender, whether they were on regular (tenured/tenure track)
appointment or were part-time or full-time adjuncts, and their years of experience at Occidental.
We feel Occidental College is particularly well suited for a case study of race and gender
differentials in student ratings of instructors because of its relatively high level of diversity for
both faculty and students. Table III.1 compares Occidental to other national arts colleges in terms
of the race and gender composition of its full-time faculty and students. Based on Herfindahl
indexes constructed from each college’s instructor race-gender employment shares and student
race-gender enrollment shares, Occidental ranks 14th for faculty diversity and 8th for student
diversity among US News national liberal arts colleges.2 Occidental’s comparatively high
diversity in turn generates a comparatively wide range of demographic variation across the
classes in our dataset.
Table III.1: Race and gender composition of full-time faculty and students: Occidental College and average
values for US News national liberal arts colleges, Academic Year 2009-10
Occidental College
Male Female Total
Percentage composition of full-time faculty:
American Indian / Alaska native
Asian / Hawaiian / Pacific Islander
Black / African American
Hispanic / Latino
White non-Hispanic
0.0%
4.1%
4.6%
4.3%
39.7%
52.7%
0.0%
7.8%
2.0%
7.4%
30.1%
47.3%
0.0%
11.9%
6.6%
11.7%
69.8%
100.0%
US News national liberal
arts college average
Male Female Total
Occidental rank
Male Female Total
0.1%
2.4%
3.0%
1.3%
48.3%
55.1%
0.2%
4.8%
6.4%
2.8%
85.9%
100.1%
42
37
26
12
212
169
42
9
83
4
215
91
70
13
35
4
239
0.7%
4.5%
12.3%
5.3%
77.1%
100.0%
27
7
114
19
205
133
37
12
108
22
217
131
32
10
124
15
228
0.1%
2.4%
3.4%
1.5%
37.6%
45.0%
Total
Percentage composition of full-time students:
American Indian / Alaska native
0.6%
0.6%
1.2%
0.3%
0.4%
Asian / Hawaiian / Pacific Islander
6.9%
9.8%
16.7%
1.7%
2.8%
Black / African American
2.7%
3.2%
5.9%
4.8%
7.5%
Hispanic / Latino
5.8%
8.1%
13.9%
2.1%
3.1%
White non-Hispanic
27.9% 34.3% 62.3% 33.2% 44.0%
Total 43.9% 56.1% 100.0% 42.1% 57.9%
US News sample: 259 national liberal colleges with instructor data and 263 with student data
2
The Herfindahl indexes are calculated as the sum of the squares of the individual race-gender shares of
employment for full-time faculty and of enrollment for full-time students. The data come from the IPEDS Data
Center of the National Center for Education Statistics <http://nces.ed.gov/ipeds/datacenter/>.
Page 10 of 35
In terms of overall methodology, like Smith (2007) and Smith and Hawkins (2011), we focus
explicitly on how student ratings of instructors vary by race of instructor, and we extend the
focus to gender as well. Like Centra and Guabatz (2000), we take into account the demographics
of students as well as instructors. And, like Hamermesh and Parker (2005), whose methodology
most closely matches the approach we take, we control for factors other than demographics that
might account for differences in student ratings of instructors. Later in our paper, we borrow
from the labor economics literature on the sources of earnings differentials by race and gender
(Oaxaca, 1973) by undertaking Oaxaca decompositions of student ratings differentials to help
further explore our main results.
The basic structure of the regression equations we estimate has the form
(1) Qn =  + Xn + Zn + n.
The subscript n denotes a sample observation which by default is a class as in Hamermesh and
Parker (2005) but in some specified cases is an individual student evaluation. The dependent
variable Q is the student rating of instructor (SRI), either the class average or an individual
student as appropriate. More specifically, the SRI corresponds to the student rating on a sevenpoint descending scale in response to a course evaluation statement that reads, “Overall, the
instruction for this course was excellent.” The right-hand side expression Xn denotes a vector
summation of demographic variables for each observation (Xn) multiplied by their corresponding
estimated coefficients (). An analogous interpretation applies to the term Zn when nondemographic control variables (Zn) are included in the equation.  denotes a constant and n a
random error term for observation n.
In all of the equations we estimate, our key focus is on the variables X that directly relate to
race-gender ratings differentials. We aggregate race into three categories: White (W) for white
Page 11 of 35
Caucasians, Underrepresented minority (U) for African-Americans, Latinos and Native
Americans, and Other (O) for those classified as Asian-Americans, Other or (race) Not Reported
(Asian-Americans constitute 78% of this category). We denote
RG = R=W,U,OG=M,F
to represent the summation over the alternative demographic combinations of Race (R) and
Gender (G). For equations that incorporate demographic information for instructors but not
students, we estimate five coefficients for instructor (i) race and gender dummy variables DRGi
measured against the excluded benchmark case of White Male instructors, so Xn takes the form
RGi RGi·(DRGi)n.
When we take into account student demographics, we do so for each class by interacting the
instructor’s demographic dummy variable with the percentage enrollment in the instructor’s class
of each student demographic group (PctRGsRGi)n, so Xn then becomes a thirty-five term
expression,
RGiRGs RGsRGi·(DRGi)n·(PctRGsRGi)n + n,
excluding for estimation purposes a thirty-sixth possible pairing of White Male students and
White Male instructors.
IV. Results
IV.A. Student Rating of Instructor (SRI) differentials by instructor race and gender
As reported in Table IV.1, the pattern of average student ratings in the Occidental College
dataset dovetails with previous findings: relative to White Male instructors, other race-gender
instructor groups receive lower student ratings.
Page 12 of 35
Table IV.1: Dependent variable: Class-average Student Rating of Instructor (SRI)
Unit of observation
Indiv. Evaluation
Estimation method
OLS
Standard error adjustment
None
Class Average
WLS (by Evaluation
Count)
443 clusters by
instructor
4,297
Coef.
Pr(B=0)
5.998
0.000
Number of observations
74,072
Pr(B=0)
Independent variables
Coef.
Constant (White Male instructor (WMi)
5.998
0.000
Differential versus WMi for:
White Female instructor (WFi)
-0.100
0.000
-0.100
0.294
Underrepresented minority Male instructor (UMi)
-0.044
0.019
-0.044
0.747
Underrepresented minority Female instructor (UFi)
-0.085
0.000
-0.085
0.560
Other Male instructor (OMi)
-0.330
0.000
-0.330
0.009
Other Female instructor (OFi)
-0.186
0.000
-0.186
0.268
Note: Total sample class-average SRI, weighted by number of respondents: mean = 5.924, standard deviation =
0.837
However, most of the ratings differentials are comparatively small in our sample. The
differential for White Female instructors is -0.10 ratings points, amounting to -0.12 standard
deviations of the respondent-weighted class-average SRI. The differentials are even smaller for
Underrepresented minority faculty. Only for Other Male (primarily Asian) instructors does the
differential (-0.33 ratings points, -0.39 standard deviations) approach the neighborhood of the
size effects reported by Smith (2007), Smith and Hawkins (2011), Centra and Guabatz (2000)
and Hamermesh and Parker (2005).
Moreover, nearly all of the estimated differentials become statistically insignificant after
adjusting the error terms for clustering of the data. Our sample consists of over 74,000 individual
student course evaluations, but they are not independent observations. As survey data drawn
from 4,297 classes taught by 443 instructors, it is crucial to adjust the standard errors for the
resulting multilevel clustering of the data. The two right-hand columns of Table IV.1 report the
adjusted results, achieved by first collapsing the data to 4,297 observations of respondentweighted class average ratings and then clustering the class averages by instructor. Only the
average differential for Other Male instructors remains statistically significant. So overall, the
Page 13 of 35
starting point for our empirical analysis is a dataset with smaller and predominantly statistically
insignificant race-gender ratings differentials compared to the findings reported in past studies.
IV.B. Estimating SRI differentials when student demographics vary across classes
One reason for the small ratings differentials in our data could be that course enrollments by
a diverse student body might adhere to a pattern that offsets and thereby masks larger underlying
differentials, making it important to consider instructor and student demographics jointly.
Saunders and Saunders (1999: 467), for example, found in their TUCE III dataset that students
were overrepresented by gender in Principles of Economics classes taught by same-gender
instructors. The students in our data are consistently overrepresented in courses taught by
instructors of their own race and gender.
Panel A of Table IV.2 reports the sample frequency of average course enrollments
disaggregated by the race and gender of both instructors and students. The overall frequencies
are recalibrated in Panel B to show the average race-gender composition of classes taught by
each faculty demographic group and facilitate comparisons to the overall average class
composition, reported in the bottom row of the panel. Own-group pairings appear along the
diagonal, so, for example, the second row-second column diagonal entry reports that White
Female students (WFs) on average account for 34.7% of the enrollments in classes taught by
White Female instructors (WFi), compared to 30.8% of the enrollments in all classes. In every
case, students are overrepresented in classes taught by instructors that match their own race and
gender. Among the cross-group (off-diagonal) cases of overrepresented enrollments,
Underrepresented minority students of both genders (UMs and UFs) are overrepresented in
classes taught by Underrepresented minority instructors (UMi and UFi), and the same is true for
Other (primarily Asian) students as well. Five of the six remaining cases of overrepresentation
Page 14 of 35
are gender matched: both White Male (WMs) and Underrepresented minority Male students
enrolled in classes taught by Other Male instructors (OMi), Other Male students (OMs) in
classes taught by White Male instructors (WMi) and nonwhite female students (UFs and OFs) in
classes taught by White Female instructors. Only one of the thirty-six group pairings constitutes
overrepresentation that is unmatched for both race and gender: Underrepresented minority Male
students enrolled in classes taught by Other Female instructors (OFi).
Instructor subgroup
Table IV.2: Average class composition of students by gender and ethnicity, overall and
by instructor gender-ethnicity subgroup
Panel A: Percentage distribution of course enrollments, total sample
Student subgroup
Instrtuctor share
WMs
WFs
UMs
UFs
OMs
OFs
of enrollments
WMi
11.7% 12.0%
3.3%
4.2%
3.8%
4.8%
39.7%
WFi
7.0%
9.9%
2.4%
3.6%
2.1%
3.6%
28.5%
UMi
2.0%
2.4%
0.9%
1.4%
0.7%
1.0%
8.5%
UFi
2.2%
2.9%
1.0%
1.8%
0.6%
1.0%
9.5%
OMi
1.5%
1.4%
0.5%
0.6%
0.7%
0.9%
5.6%
OFi
2.1%
2.2%
0.7%
0.9%
1.1%
1.3%
8.3%
Total % of
enrollments
26.5%
30.8%
8.8%
12.4%
8.9%
12.5%
Instructor subgroup
Panel B: Average class race-gender composition by instructor subgroup
Student subgroup
WMs
WFs
UMs
UFs
OMs
OFs
WMi
29.5% 30.3%
8.3%
10.5%
9.5%
12.0%
WFi
24.4% 34.7%
8.3%
12.5%
7.5%
12.7%
UMi
23.8% 28.5% 10.6% 17.0%
8.3%
11.8%
UFi
23.4% 30.5% 10.6% 19.0%
5.8%
10.7%
OMi
27.0% 25.1%
9.2%
10.3% 13.0% 15.4%
OFi
25.2% 26.8%
8.9%
10.7% 13.0% 15.5%
Average % of
enrollments
26.5%
30.8%
8.8%
12.4%
8.9%
Total
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
12.5%
To explore the relationship between SRI differentials and class composition by race and
gender, we incorporated student race-gender class enrollment data into our regression equation.
We then used the regression results to calculate estimated class-average SRIs for each of the
Page 15 of 35
thirty-six possible race-gender pairings of instructors and students in order to make comparisons
of differentials, as reported in Table IV.3.
Table IV.3: Estimated average student rating of instructors (SRI) by race-gender subgroups of instructors and students
Panel B: The diagonal entries serve as column benchmarks,
and the other cells in each column report differentials versus
Instructor
subgroup
the column benchmark
weights**
Enrollment
Student subgroup
Student subgroup
wtd. avg.*
WMs
WFs
UMs
UFs
OMs
OFs
WMs
WFs
UMs
UFs
OMs
OFs
0.418
-0.035
0.313
-0.053
-0.615
-0.007
0.074
0.418
-0.316
-0.610
-0.659
-0.920
0.806
WMi
39.7%
(0.019) (0.851) (0.337) (0.891) (0.122)
(0.982)
(0.120)
(0.019) (0.170) (0.195) (0.247) (0.104) (0.181)
-0.561
0.282
0.380
-0.184
-0.219
0.168
-0.026
-0.979
0.282
-0.543
-0.790
-0.523
0.981
WFi
28.5%
(0.017) (0.033) (0.391) (0.634) (0.620)
(0.482)
(0.667)
(0.001) (0.033) (0.338) (0.166) (0.376) (0.082)
-0.469
-0.375
0.923
0.745
0.100
0.131
0.030
-0.888
-0.656
0.923
0.139
-0.204
0.944
UMi
8.5%
(0.265) (0.319) (0.008) (0.003) (0.934)
(0.710)
(0.791)
(0.053) (0.099) (0.008) (0.772) (0.874) (0.128)
-0.337
0.021
0.184
0.606
0.430
-0.927
-0.011
-0.755
-0.260
-0.739
0.606
0.126
-0.114
UFi
9.5%
(0.365) (0.938) (0.764) (0.142) (0.521)
(0.150)
(0.926)
(0.067) (0.387) (0.292) (0.142) (0.872) (0.890)
0.138
-0.299
-1.014
-1.182
0.305
-0.284
-0.256
-0.280
-0.581
-1.937
-1.788
0.305
0.529
OMi
5.6%
(0.749) (0.541) (0.110) (0.003) (0.446)
(0.608)
(0.012)
(0.548) (0.251) (0.007) (0.002) (0.446) (0.482)
-0.166
0.253
-0.003
-0.355
0.200
-0.813
-0.112
-0.585
-0.028
-0.926
-0.961
-0.104
-0.813
OFi
8.3%
(0.749) (0.409) (0.997) (0.469) (0.624)
(0.113)
(0.446)
(0.292) (0.932) (0.352) (0.132) (0.855) (0.113)
**share
of total
Instructor subgroup weighted** average differential of each
*derived from Instructor subgroup weighted** average differntial of each
enrolled
data in Table student subgroup from its respective benchmark
student subgroup from the overall sample average:
students in
-0.072
0.041
0.272
-0.048
-0.224
-0.114 IV.2 Panel A.
-0.813
-0.337
-0.712
-0.722
-0.560
0.761 dataset
Least squares estimation of 4,297 observations of class-average ratings weighted by number of respondents and clustered by 443 instructors.
Values in parentheses correspond to Pr(Ho): Ho for Panel A and diagonals of Panel B: Deviation from average = 0. Ho for Panel B off-diagonal estimates:
Deviation from column benchmark = 0.
Note: Overall sample class-average SRI, weighted by number of respondents: mean = 5.924, standard deviation = 0.837
Instructor subgroup
Panel A: Each cell reports each subgroup's estimated class-average SRI
differential versus the overall sample average
Panel A of Table IV.3 reports estimated class-average differentials for each studentinstructor pairing relative to the overall sample average SRI. The highlighted values along the
diagonal report own-group race-gender pairings. For example, the first diagonal entry indicates
that White Male students rate White Male instructors 0.418 ratings points higher than the fullsample average rating of 5.924. Except for Other Females, all of the estimated own-group
differentials are positive and sizeable, averaging 0.51 ratings points (0.61 standard deviations).
The estimates are statistically significant at probabilities under five percent for Whites (WM and
WF) and for Underrepresented minority Males (UM).
Ratings differentials relative to the overall sample mean are driven by potential ratings
differentials both between student groups and within them. Some race-gender student groups
may “grade harder” than others across the board. That can certainly affect average instructor
SRIs when enrollment patterns vary by instructor groups, but it can also coexist with an absence
of race-gender ratings differentials when measured within the context of the “grading standards”
Page 16 of 35
that different student groups might apply. The bottom row of data in Panel A of Table IV.3
reports the fixed-weight average SRI differentials for each student group, with the dataset share
of total course enrollments for each instructor group serving as weights (reported in the righthand column of the table). The average differentials across the student groups span 0.50 ratings
points (0.59 standard deviations), large enough to play an important role in the estimated
disaggregated ratings.
Panel B of Table IV.3 carries over the diagonal entries from Panel A but treats each one as a
benchmark differential for its respective student group—specifically, a benchmark corresponding
to how highly the students in that group rate instructors of their own race and gender. The offdiagonal entries in each column are re-calibrated as SRI differentials relative to the column
benchmark, so they provide estimated averages of how the student group rates instructors of
differing race and gender (cross-group benchmarked differentials) compared to how they rate
instructors of their own race and gender. For example, the -0.316 entry at the top of the second
column constitutes an estimate of how much lower on average White Female students rate White
Male instructors compared to White Female instructors. In short, the off-diagonal entries
measure estimated ratings differentials within each student demographic group.
Other Female students remain an outlier in Panel B, but a consistent pattern spans the other
student groups. Estimated cross-group ratings are lower than own-group ratings in 23 of 25
cases; the two exceptions are positive differentials for Underrepresented minority Male
instructors by Underrepresented minority Female students and for Underrepresented minority
Female instructors by Other Male students. The sizes of the estimated differentials are typically
large compared to previously reported findings, averaging 0.63 ratings points (0.75 standard
deviations), although only a few of the estimates are statistically significant, namely, lower
Page 17 of 35
ratings of White Female instructors by White Male students and lower ratings of Other Male
instructors by Underrepresented minority Male and Female students. “Large insignificant”
effects may be the Oxymoron that best summarizes the estimates reported in Table IV.3, but the
results nonetheless highlight the importance of accounting for both instructor and student
demographics when estimating race-gender differentials in student ratings of instructors for
samples in which race and gender diversity characterizes both students and instructors.
For expository purposes, it is convenient to refer to group-specific differentials as if they
correspond to the student ratings per se, but such a connection is unavoidably speculative here
since our dataset does not include the actual race-gender identity of individual respondents. The
empirical outlier group of Other Female students with its large negative differential (-0.81 ratings
point) for ratings of Other Female instructors and higher ratings for all other instructor
demographic groups provides an illustrative case in point. Based on our data and findings, we
cannot rule out the possibility that Other Female students rate instructors in this fashion, but
neither can we confirm it, and it appears inconsistent with the observed pattern of
overrepresentation of Other Female students in classes taught by Other Female instructors. Our
results in fact indicate only that the classes taught by Other Female instructors in our dataset are
associated with lower ratings as the proportion of Other Female students in those classes rises.
IV.C. Controlling for non-demographic factors that are likely to be associated with SRIs
Although our dataset fails to include the race and gender of individual student respondents, it
does include respondent-specific information about many other non-demographic factors that are
likely to be associated with student ratings of instructors, which gives us the opportunity to
reduce the risks of omitted variable bias. Our preferred specification incorporates these other
factors, similar to the approach taken by Hamermesh and Parker (2005).
Page 18 of 35
In their study of the relationship between student ratings and the perceived beauty of
instructors, Hamermesh and Parker (2005) included two additional dummy control variables
related to classes (Lower division, One-credit course) and four related to instructors (Female,
Minority, Non-native English speaker, Tenure track). We incorporated a similar but more
extensive set of class and instructor variables, listed in Table IV.4.
Table IV.4: Descriptive data for non-demographic control variables
Variable
Sample frequency
Course level:
100-level (dummy)
41%
200-300 level (excluded)
56%
400-level (dummy)
3%
Course classification:
Cultural Studies Program (dummy)
8%
Arts and Humanities (dummy)
35%
Science (dummy)
26%
Social Sciences (excluded)
30%
Seminar course (dummy)
14%
Classiifcation of instructor
Part-time adjunct (dummy)
17%
Full-time adjunct (dummy)
21%
Tenure/tenure track (excluded)
62%
New offering by instructor (dummy)
18%
Sample average
Instructor years of experience at Occidental
11.3
Years of experience if greater than six
18.8
Total class enrollment
23.0
Average student seniority (1=Frosh, 4=Senior)
2.4
Percent graduate student enrollment
0.4%
Percent enrolled for Core requirement
30%
Percent enrolled for Major requirement
43%
Evaluation response rate
90%
Average grade awarded (A=4.00)
3.29
Average expected grade (A=4.00)
3.41
Following Hamermesh and Parker (2005), we first estimated the ratings equation with the
instructor race-gender dummy variables but not the variables for student demographics. Table
Page 19 of 35
IV.5 reports the results. Among the non-demographic variables, the most sizeable and
statistically significant include positive estimated coefficients for the response rate and expected
grade, and negative estimated coefficients for science courses, courses taught by adjunct
instructors, new course offerings, and the percentage of students enrolled to fulfill core or major
field requirements. The estimated race-gender differentials versus White Males average -0.26
standard deviations, 45% larger than the estimates without controls, as reported in Table IV.1,
and the revised estimates are universally larger and more statistically significant. However, the
average size effect measured in standard deviations is still only about half as large as
Hamermesh and Parker (2005) found for their sample. And, when clustering specifically by
instructor, as reported in the middle two columns of Table IV.5, the estimated differentials for
our sample are statistically significant only for Other instructors.
Page 20 of 35
Table IV.5: Dependent variable: Class average overall student rating of instruction (4,297 observations)
Demographic variables
Standard error adjustment
Instructor only
Robust
443 clusters by
instructor
Instructor plus student
demographics
443 clusters by
instructor
Independent variables
Coef.
Pr(Ho)
Coef.
Pr(Ho)
Coef.
Pr(Ho)
Constant
3.579
0.000
3.579
0.000
3.738
0.000
100-level course
-0.082
0.035
-0.082
0.235
-0.094
0.163
400-level course
-0.103
0.160
-0.103
0.272
-0.098
0.280
Cultural Studies Program course
-0.116
0.193
-0.116
0.491
-0.107
0.509
Arts and Humanities course
0.060
0.094
0.060
0.537
0.077
0.421
Science course
-0.223
0.000
-0.223
0.040
-0.219
0.046
Seminar course
0.122
0.009
0.122
0.084
0.125
0.066
Part-time adjunct instructor
-0.357
0.000
-0.357
0.000
-0.355
0.000
Full-time adjunct instructor
-0.201
0.000
-0.201
0.041
-0.216
0.028
Years of experience at Occidental
0.018
0.043
0.018
0.310
0.016
0.371
Years of experience if greater than six
-0.022
0.006
-0.022
0.148
-0.020
0.184
New offering by instructor
-0.221
0.000
-0.221
0.004
-0.218
0.003
Total enrollment
0.004
0.041
0.004
0.285
0.004
0.219
Average student seniority
0.040
0.144
0.040
0.350
0.035
0.396
Percent graduate student enrollment
-0.210
0.440
-0.210
0.601
-0.218
0.604
Percent of students enrolled for Core requirement
-0.430
0.000
-0.430
0.005
-0.425
0.004
Percent of students enrolled for Major requirement
-0.279
0.000
-0.279
0.005
-0.276
0.005
Evaluation response rate
0.787
0.000
0.787
0.000
0.770
0.000
Average grade awarded
0.012
0.852
0.012
0.899
0.022
0.817
Average expected grade
0.588
0.000
0.588
0.000
0.590
0.000
Differential versus WMi for:
White Female instructor (WFi)
-0.107
0.000
-0.107
0.192
Race-gender
Underrepresented minority Male instructor (UMi)
-0.179
0.000
-0.179
0.122
differentials reported in
Underrepresented minority Female instructor (UFi)
-0.167
0.000
-0.167
0.186
Table IV.6
Other Male instructor (OMi)
-0.349
0.000
-0.349
0.005
Other Female instructor (OFi)
-0.284
0.000
-0.284
0.048
Note: Overall class-average rating, weighted by number of respondents: mean = 5.924, standard deviation = 0.837.
Replacing the instructor race-gender dummy variables with the full set of student
demographic variables has virtually no effect on the estimated sizes and significance levels of the
non-demographic control variables, reported in the last two columns of Table IV.5. The
estimated race-gender differentials with the non-demographic control variables included are
reported in Table IV.6. Again excluding the outlier group of Other Female students and focusing
first on Panel A, the most striking effect is a sharp reduction in the sizes and statistical
significance of the diagonal entries that correspond to the own-group differentials versus the
overall sample average SRI. These estimated differentials now average only 0.24 ratings points
(0.28 standard deviations), compared to 0.51 ratings points when non-demographic variables
were excluded, as reported in Table IV.3. None of the revised own-group estimates is
Page 21 of 35
statistically significant at probabilities under five percent, compared to three such instances
before. For the off-diagonal cross-group estimates, the average of the predominantly negative
estimated differentials likewise shows a smaller deviation from the overall mean SRI, rising 0.04
ratings points from -0.12 to -0.08 with the addition of the non-demographic control variables.
Table IV.6: Estimated average student rating of instructors (SRI) by race-gender subgroups of instructors and students, controlling for other factors that affect SRIs
Panel A: Each cell reports each subgroup's estimated class-average SRI
differential versus the overall sample average
Instructor subgroup
WMi
WFi
UMi
UFi
OMi
OFi
Student subgroup
WMs
WFs
UMs
UFs
OMs
OFs
-0.212
0.585
0.137
0.022
0.165
0.285
(0.181)
(0.061)
(0.598)
(0.949)
(0.559)
(0.089)
-0.371
0.564
-0.083
0.201
0.392
-0.036
(0.102)
(0.160)
(0.803)
(0.589)
(0.086)
(0.771)
-0.427
-0.457
-0.066
0.573
0.594
0.458
(0.302)
(0.118)
(0.796)
(0.608)
(0.121)
(0.152)
-0.057
-0.075
-0.008
0.426
-0.683
0.113
(0.870)
(0.783)
(0.989)
(0.441)
(0.203)
(0.766)
0.139
-0.175
-1.273
-1.261
-0.248
0.367
(0.660)
(0.693)
(0.084)
(0.025)
(0.596)
(0.349)
-0.193
-0.032
-0.115
-0.136
0.227
-0.831
(0.637)
(0.904)
(0.883)
(0.746)
(0.524)
(0.101)
Instructor subgroup weighted** average differential of each
student subgroup from the overall sample average:
Panel B: The diagonal entries serve as column benchmarks,
and the other cells in each column report differentials versus
Instructor
the column benchmark
subgroup
weights**
Student subgroup
Enrollment
wtd. avg.*
WMs
WFs
UMs
UFs
OMs
OFs
0.105
-0.176
0.127
0.024
-0.346
0.997
0.285
39.7%
(0.010)
(0.386)
(0.780)
(0.959)
(0.499)
(0.084)
(0.089)
-0.002
-0.656
0.105
-0.196
-0.167
1.223
-0.036
28.5%
(0.972)
(0.029)
(0.836)
(0.694)
(0.751)
(0.028)
(0.771)
-0.077
-0.712
-0.421
-0.180
0.205
1.426
0.458
8.5%
(0.423)
(0.113)
(0.167)
(0.698)
(0.863)
(0.025)
(0.152)
-0.063
-0.342
-0.039
-0.466
0.059
0.148
0.113
9.5%
(0.560)
(0.375)
(0.895)
(0.451)
(0.930)
(0.843)
(0.766)
-0.243
-0.146
-0.140
-1.731
-1.374
0.584
0.367
5.6%
(0.013)
(0.681)
(0.766)
(0.032)
(0.045)
(0.395)
(0.349)
-0.181
-0.478
0.003
-0.573
-0.250
-0.140
-0.831
8.3%
(0.152)
(0.287)
(0.991)
(0.498)
(0.665)
(0.790)
(0.101)
**share of total
*derived from Instructor subgroup weighted** average differntial of each
enrolled
data in Table
student subgroup from its respective benchmark
IV.2 Panel A.
students in
-0.042
-0.153
0.351
-0.046
0.194
0.081
-0.543
-0.164
-0.117
-0.176
-0.184
0.994 dataset
Least squares estimation of 4,297 observations of class-average ratings weighted by number of respondents and clustered by 443 instructors.
Values in parentheses correspond to Pr(Ho): Ho for Panel A and diagonals of Panel B: Deviation from average = 0. Ho for Panel B off-diagonal estimates: Deviation
from column benchmark = 0.
Note: Overall sample class-average SRI, weighted by number of respondents: mean = 5.924, standard deviation = 0.837
The fall in the estimated differentials for own-groups and rise in the estimated differentials
for cross-groups together serve to substantially reduce the size and statistical significance of the
estimated cross-group benchmarked differentials reported in Panel B of Table IV.6. The average
size of the estimated ratings differentials within student groups relative to their respective owngroup benchmark falls by nearly half, from 0.63 ratings points in Table IV.3 to 0.32 ratings
points (0.38 standard deviations) once non-demographic explanatory variables are included. The
pattern of results is less consistent as well, with six instances of positive differentials for crossgroup ratings in Table IV.6 compared to only two in Table IV.3. Both with and without nondemographic control variables, the negative estimated differentials for White Male students
rating White Female instructors and Underrepresented minority students rating Other Male
instructors are statistically significant at probabilities under five percent. For the outlier group of
Page 22 of 35
Other Female students, adding non-demographic control variables results in positive estimated
differentials that are statistically significant at slightly less than a three percent probability for
ratings of White Female and Underrepresented minority Male instructors.
Chart IV.1 illustrates the impact of the addition of the non-demographic control variables on
the estimated race-gender differentials. The bold dashed horizontal line indicates the overall
sample average rating (5.92). The thicker solid arrows show how the estimated own-group
ratings change. The thinner solid arrows show how the estimated cross-group ratings change, and
they constitute the average of the dotted-line arrows that correspond to the disaggregated crossgroup ratings. Except for the outlier case of Other Female students, the arrows indicate
substantial convergence of the own-group and sub-group ratings when non-demographic factors
are included as control variables. For the first four groups of White and Underrepresented
minority students, the driving force is the large decline of the own- group ratings as the control
variables move them closer to the overall sample average. For Other Male and Other Female
students, the own-group ratings barely change, but the cross-group ratings rise substantially. This
reversed pattern nonetheless promotes convergence for Other Male students, in line with the
other student demographic groups. Only for Other Female students do the own-group and crossgroup ratings exhibit divergence, as that is the only student group for which the initial own-group
rating is below its cross-group average. To paraphrase from previous research, the most
pronounced impact of the addition of non-demographic control variables is to reduce estimates
of student “same race-gender preferences” (Centra and Guabatz, 2000: 32). It is in this respect
that omitted variable bias matters most here, and it matters enough to expose the fragility of
results for specifications that fail to control for non-demographic factors.
Page 23 of 35
Chart IV.1: Estimated race-gender SRI differentials before and after
the addition of non-demographic control variables
7.500
7.000
Average estimated SRI
6.500
WMi
WFi
6.000
UMi
UFi
OMi
OFi
5.500
Cross-group avg
5.000
4.500
WMs1
WMs2
WFs1
WFs2
UMs1
UMs2
UFs1
UFs2
OMs1
OMs2
OFs1
OFs2
Student demographic group. 1=ratings before and 2=ratings after including non-demographic control variables
The student evaluation forms provide additional self-reported information, such as the
number of classes missed, self-assessment of the extent of knowledge and skills learned, and
ratings of specific teaching strengths. However, considerable uncertainty applies to the questions
of whether to include this self-reported information and, if so, how to interpret the resulting
estimates. For example, if in fact gender per se plays no role in student ratings of instructors, but
course organization differs by gender, then failure to include ratings of course organization
results in a false impression that gender matters. If, instead, gender does matter and to the same
extent in student ratings of both overall instruction and organization, then including student
ratings of organization masks the role of gender in student ratings. In an effort to utilize the selfreported information while being mindful of these uncertainties, we ran estimates that
progressively introduced self-reported information in three steps:
Page 24 of 35
1. Self-reported behavior: (1) Hours worked outside class, (2) number of classes missed,
and (3) extent of conversation of course material outside of class.
2. Self-reported learning outcomes: (1) contribution of course to knowledge, (2)
contribution of course to skills, and (3) expected course grade.
3. Ratings of instructor effectiveness in specific areas: (1) communicating course goals, (2)
fulfilling course goals, (3) being organized, (4) giving clear assignments, (5) giving clear
grading criteria, (6) giving helpful feedback, (7) being clear, (8) stimulating intellectual
enthusiasm, (9) inviting students to confer outside class, (10) inviting questions, (11)
asking questions, and (12) responding well to questions.
Since these data items vary across individual student evaluations, we used each evaluation rather
than class averages as our unit of observation and estimated the equations as a multilevel model
grouped by class and clustered by instructor.
Table IV.7a reports the results for the ratings differentials. Adding the self-reported student
behavior variables has only a modest effect on the estimated differentials: a less than 7 percent
median decline of the estimated size of the differentials. Self-reported learning outcomes have a
much larger effect, prompting a median decline of another 46 percent, and the ratings of specific
teaching strengths have an even larger effect, resulting in a further median decline of 56 percent.
Taken together, the addition of all of the self-reported variables is associated with an 82 percent
median decline in the size of the estimated differentials. A small number of estimated
differentials remain statistically significant, but the sizes are uniformly small, averaging less than
0.05 ratings points for both the own-group differentials versus the sample mean and the crossgroup differentials versus their respective benchmarks.
Page 25 of 35
Table IV.7a: Ratings differentials versus (1) sample mean rating for own-groups and (2) own-group rating for cross-groups: the impact of including
self-reported information
Observations
Sample mean Student Rating of Instructor
White Male students evaluating…
White Male instructors
White Female instructors
Underrepresented minority Male instructors
Underrepresented minority Female instructors
Other Male instructors
Other Female instructors
White Female students evaluating…
White Male instructors
White Female instructors
Underrepresented minority Male instructors
Underrepresented minority Female instructors
Other Male instructors
Other Female instructors
Underrepresented minority Male students evaluating…
White Male instructors
White Female instructors
Underrepresented minority Male instructors
Underrepresented minority Female instructors
Other Male instructors
Other Female instructors
Underrepresented minority Female students evaluating…
White Male instructors
White Female instructors
Underrepresented minority Male instructors
Underrepresented minority Female instructors
Other Male instructors
Other Female instructors
Other male students evaluating…
White Male instructors
White Female instructors
Underrepresented minority Male instructors
Underrepresented minority Female instructors
Other Male instructors
Other Female instructors
Other Female students evaluating…
White Male instructors
White Female instructors
Underrepresented minority Male instructors
Underrepresented minority Female instructors
Other Male instructors
Other Female instructors
Includes student demographics and previous list of non-demographic control variables
Plus self-reported student behavior (hours, absences, conversations)
Plus self-reported learning outcomes (knowledge, skills)
Plus student ratings of
teaching strengths
71,467
70,263
68,108
66,240
5.922
5.922
5.929
5.938
Diff.
Pr(Ho)
Diff.
Pr(Ho)
Diff.
Pr(Ho)
Diff.
Pr(Ho)
0.274
-0.572
-0.759
-0.441
-0.031
-0.494
0.077
0.028
0.079
0.250
0.931
0.245
0.274
-0.547
-0.803
-0.344
-0.020
-0.490
0.089
0.032
0.058
0.356
0.956
0.240
0.137
-0.280
-0.557
-0.099
0.011
-0.298
0.124
0.062
0.039
0.656
0.963
0.218
0.094
-0.109
-0.193
-0.041
0.107
-0.041
0.002
0.040
0.044
0.614
0.220
0.573
0.008
-0.048
-0.377
0.109
-0.328
0.043
0.967
0.694
0.225
0.723
0.449
0.889
0.010
-0.044
-0.389
0.120
-0.296
0.024
0.960
0.715
0.187
0.684
0.504
0.936
0.001
-0.092
-0.108
0.079
-0.269
-0.019
0.994
0.194
0.578
0.651
0.198
0.906
0.007
-0.073
-0.017
-0.018
-0.050
0.032
0.878
0.013
0.820
0.785
0.438
0.658
-0.163
-0.286
0.668
-0.756
-1.539
-0.624
0.688
0.596
0.022
0.212
0.043
0.451
-0.169
-0.131
0.581
-0.679
-1.261
-0.635
0.683
0.803
0.048
0.255
0.088
0.432
0.134
0.031
0.247
-0.235
-0.220
-0.036
0.643
0.918
0.261
0.490
0.628
0.917
0.063
0.028
0.076
-0.084
0.036
-0.075
0.591
0.812
0.408
0.601
0.840
0.614
-0.093
-0.265
0.093
0.081
-1.467
-0.096
0.836
0.613
0.835
0.827
0.012
0.856
-0.023
-0.301
0.120
0.039
-1.369
-0.013
0.958
0.556
0.784
0.911
0.018
0.979
-0.018
-0.113
0.095
0.072
-0.872
0.078
0.949
0.704
0.757
0.753
0.018
0.810
-0.009
0.000
0.112
-0.029
-0.241
0.010
0.918
0.997
0.235
0.680
0.103
0.955
-0.193
-0.053
0.014
0.095
0.244
0.030
0.665
0.910
0.990
0.872
0.469
0.947
-0.240
-0.045
0.086
-0.002
0.248
0.056
0.575
0.919
0.930
0.997
0.432
0.894
-0.218
-0.026
0.028
-0.123
0.166
-0.048
0.445
0.926
0.966
0.750
0.440
0.853
-0.114
-0.158
-0.056
-0.235
0.218
-0.135
0.363
0.199
0.804
0.123
0.033
0.279
0.936
1.494
1.329
0.281
0.724
-0.989
0.098
0.006
0.051
0.684
0.265
0.054
0.860
1.415
1.237
0.281
0.613
-0.921
0.141
0.013
0.074
0.688
0.369
0.086
0.359
0.572
0.345
0.152
0.155
-0.305
0.210
0.036
0.379
0.667
0.655
0.217
0.001
0.004
-0.101
0.071
-0.137
-0.023
0.992
0.974
0.530
0.662
0.315
0.817
However, it is evident from Table IV.7b that the same impact pattern pertains to the
estimated coefficients for our initial set of non-demographic control variables as well. A median
decline of 8 percent results from the addition of the behavior variables, another 41 percent from
the learning outcomes variables, and another 65 percent from the ratings of teaching strengths.
The overall median decline totals 85 percent.
Page 26 of 35
Table IV.7b: The impact of including self-reported information on estimated coefficients for non-demographic variables
Included variables:
100-level course
400-level course
Cultural Studies Program course
Arts and Humanities course
Science course
Seminar course
Part-time adjunct instructor
Full-time adjunct instructor
Years of experience at Occidental
Years of experience if greater than six
New offering by instructor
Total enrollment
Average student seniority
Percent graduate student enrollment
Percent of students enrolled for Core requirement
Percent of students enrolled for Major requirement
Evaluation response rate
Average grade awarded
Average expected grade
Average hours worked outside of class
Average number of classes missed
Average rating: discussed course material outside of class
Average rating:
Course contribution to knowledge
Course contribution to skills
Instructor communicated goals
Instructor fulfilled goals
Course organization
Clear assignments
Clear grading criteria
Helpful feedback
Clear explanations of concepts
Motivated intellectual enthusiasm
Instructor invited individual meetings
Instructor invited questions
Instructor asked questions
Instructor was responsive to questions
Demographics
Coef.
Pr(Ho)
-0.087
0.150
-0.162
0.051
-0.232
0.047
0.051
0.587
-0.285
0.008
0.081
0.241
-0.334
0.000
-0.196
0.041
0.011
0.536
-0.016
0.290
-0.184
0.012
0.002
0.563
0.066
0.090
0.038
0.925
-0.041
0.003
0.045
0.000
0.708
0.000
0.247
0.002
0.324
0.000
+ behavior
Coef.
Pr(Ho)
-0.084
0.150
-0.230
0.007
-0.263
0.024
0.066
0.468
-0.289
0.005
0.061
0.375
-0.320
0.000
-0.172
0.067
0.010
0.575
-0.015
0.329
-0.184
0.009
0.002
0.478
0.047
0.215
-0.116
0.751
-0.025
0.061
0.018
0.126
0.625
0.000
0.228
0.003
0.286
0.000
0.010
0.000
-0.016
0.000
0.138
0.000
+ outcomes
Coef.
Pr(Ho)
-0.052
0.135
-0.059
0.232
-0.180
0.016
0.024
0.620
-0.174
0.002
0.024
0.584
-0.192
0.000
-0.102
0.073
0.007
0.507
-0.010
0.264
-0.088
0.028
0.003
0.104
-0.023
0.331
0.031
0.871
0.010
0.344
-0.012
0.189
0.336
0.000
0.212
0.000
0.070
0.000
0.000
0.875
0.003
0.083
0.028
0.000
+ ratings
Coef.
Pr(Ho)
0.006
0.609
-0.021
0.233
-0.027
0.297
0.033
0.029
-0.016
0.275
-0.010
0.495
-0.049
0.004
-0.035
0.039
0.003
0.280
-0.003
0.323
-0.034
0.012
0.000
0.898
-0.018
0.027
-0.016
0.814
0.006
0.400
0.004
0.444
0.073
0.022
0.073
0.000
-0.002
0.669
-0.001
0.162
0.002
0.038
-0.003
0.143
0.421
0.374
0.115
0.069
-0.030
0.162
0.141
0.029
0.030
0.087
0.182
0.171
0.023
0.015
0.050
0.133
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.002
0.000
0.000
The estimated coefficients for all of the variables for self-reported outcomes and ratings of
teaching strength are statistically significant and all but one are positive, as expected. Moreover,
the ranking of the sizes of the estimated relationships between the itemized teaching strengths
and the overall SRI has a commonsense pattern to it, with clarity, intellectual enthusiasm,
fulfilling course goals, organization and responsiveness to questions topping the list. However,
the association between the overall SRI and the additional self-reported information variables is
virtually tautological, so it is hardly surprising that including them swamps the estimated effects
of both the non-demographic variables and the demographic variables included in our original
specification.
Page 27 of 35
IV.D. Applying a Oaxaca decomposition to average SRI differentials by gender and race
The Oaxaca (1973) decomposition was initially utilized by labor economists to explore the
possibility of wage discrimination by gender. The decomposition entails a two-step process in
which (1) a wage-determination regression equation is estimated for a benchmark group, say
male workers ( MXM), and (2) the average comparison-group wage, in this case for female
workers, is predicted by evaluating the benchmark equation at the comparison-group means for
̅ F). The value of the gap between this predicted wage and the
the explanatory variables (M𝑿
benchmark wage constitutes the amount of the gender wage differential that is “explained” by
differences in productivity characteristics (“endowments”); any remaining gap is unexplained by
productivity differences and may be due to gender discrimination in the labor market.
Here, we apply a Oaxaca decomposition to SRI differentials by gender and race. We again
incorporate student self-reported information in the same sequence as in Section IV.7, so we
likewise again estimate a multilevel model of individual student evaluations grouped by class
and clustered by instructor. Table IV.8 reports the decomposition results, Panel A for gender and
Panel B for race.
Page 28 of 35
Table IV.8. Blinder-Oaxaca decomposition of overall student rating of instructor (SRI) differentials by gender and race
Predicted SRI when average female / nonwhite instructor characteristics are substituted
into the estimated benchmark regression equation which includes non-demographic
control variables plus variables for …
Plus: Subgroup-specific average student self-reported data for:
+Learning outcome +Teaching strength
+Behavior
ratings
ratings
Panel A: By gender: Benchamark = Male instructors
Avg.SRI
Sample
Avg.SRI
Sample
Avg.SRI
Sample
Benchmark respondents
5.953
37,741
5.961
36,637
5.969
35,609
Respondents for Female instructors
5.887
32,522
5.893
31,471
5.903
30,631
Total differential versus Benchmark
-0.066
-0.067
-0.067
Accounted for by differences in…
Amount
Percent
Amount
Percent
Amount
Percent
1. Instructor non-demographic factors
-0.002
-2.7%
0.001
1.2%
-0.001
-2.2%
2. Student self-reported information and ratings
-0.019
-28.4%
-0.068
-100.4%
-0.044
-65.6%
3. Class non-demographic characteristics
0.000
-0.7%
0.004
5.5%
0.004
6.4%
0.005
8.0%
4. Student race-gender class composition
-0.014
-20.8%
-0.013
-19.4%
-0.006
-9.0%
-0.008
-12.4%
Predicted average SRI (M'X̅F)
5.938
5.923
5.892
5.921
Unexplained differential
-0.051
-77.4%
-0.036
-55.0%
0.001
1.7%
-0.018
-27.7%
Panel B: By race: Benchmark = White instructors
Avg.SRI
Sample
Avg.SRI
Sample
Avg.SRI
Sample
Avg.SRI
Sample
Benchmark respondents
5.953
48,645
5.952
47,786
5.959
46,399
5.969
45,096
Panel B1: Respondents for Underrepresented minority instructors
5.933
12,886
5.935
12,683
5.944
12,250
5.952
11,922
Total differential versus Benchmark
-0.019
-0.017
-0.015
-0.017
Accounted for by differences in…
Amount
Percent
Amount
Percent
Amount
Percent
Amount
Percent
1. Instructor non-demographic factors
0.066
340.9%
0.061
349.4%
0.042
268.7%
0.008
44.9%
2. Student self-reported information and ratings
-0.003
-16.5%
-0.032
-209.1%
-0.006
-34.9%
3. Class non-demographic characteristics
0.029
151.6%
0.033
192.0%
0.022
139.4%
0.003
18.1%
4. Student race-gender class composition
-0.004
-19.3%
-0.005
-28.8%
0.007
42.4%
-0.002
-12.3%
Predicted average SRI (W 'X̅U)
6.044
6.039
5.996
5.971
Unexplained differential
-0.111
-573.2%
-0.104
-596.2%
-0.053
-341.5%
-0.019
-115.9%
Panel B2: Respondents for Other instructors
5.754
9,936
5.754
9,794
5.763
9,459
5.772
9,222
Total differential versus Benchmark
-0.199
-0.199
-0.196
-0.196
Accounted for by differences in…
Amount
Percent
Amount
Percent
Amount
Percent
Amount
Percent
1. Instructor non-demographic factors
0.036
17.9%
0.031
15.6%
0.021
10.7%
-0.002
-1.0%
2. Student self-reported information and ratings
-0.017
-8.3%
-0.124
-63.3%
-0.203
-103.2%
3. Class non-demographic characteristics
0.019
9.6%
0.022
11.1%
0.016
8.0%
-0.002
-0.9%
4. Student race-gender class composition
0.015
7.6%
0.013
6.4%
0.012
6.1%
0.008
4.1%
Predicted average SRI (W 'X̅O)
6.022
6.001
5.883
5.770
Unexplained differential
-0.269
-135.0%
-0.248
-124.8%
-0.121
-61.5%
0.002
1.0%
Note: Variables include: 1. Instructor non-demographic factors: adjunct status, experience, new course preparation, response rate, GPA and expected grade; 2.
Student self-reported information and ratings: (1) behavior: hours, absences, extent of out-of-class course conversation, and (2) ratings of learning outcomes and (3)
ratings of itemized teaching strenghts; 3. Class non-demographic characteristics: course level and division, enrollment, type if seminar, average student seniority,
and percentages enrolled for Core and for major requirements; 4. Student race-gender class composition: percentage enrollment of White Females,
Undeerrepresented minority Males, Undeerrepresented minority Females, Other Males and Other Females.
Benchmark student
demographics
Avg.SRI
Sample
5.953
38,381
5.887
33,086
-0.066
Amount
Percent
-0.001
-1.1%
The Oaxaca decomposition as applied here includes some complexities not encountered in
the analysis of gender wage differentials, so it is instructive to first interpret the pattern of results
reported in Table IV.8. Consider, for example, the second set of columns in Panel A, the student
ratings differential by gender for the specification which includes student self-reported data for
study hours, absences and the extent of out-of-class conversations about course material. Overall,
the average SRI for female instructors is 0.066 points lower than for male instructors. The first
category of explanatory variables, the non-demographic instructor factors such as teaching
experience, is most analogous to productivity characteristics in wage determination models, and
here it accounts for less three percent of the total ratings differential. Self-reported student
behavior variables account for another 28 percent of the differential and may serve as proxies for
Page 29 of 35
instructor effectiveness in engaging students, although we hasten to note as before that it is
unclear whether the association between the overall SRI and other student self-reported
information reflects causation or codetermination. Non-demographic class characteristics such as
enrollment and course discipline have a positive value of .004 ratings points (5.5% of the total
differential) which suggests that Male instructors facing class characteristics that match what
Female instructors experience on average would receive a higher rather than lower rating,
thereby widening instead of narrowing the adjusted gender differential. The demographic
composition of enrolled students accounts for 19.4% of the ratings differential which helps to
close the ratings gap while simultaneously falling into the category of ratings factors that are
linked to race and gender. After controlling for all of these variables, .036 ratings points,
amounting to 55% of the unadjusted differential, remains unexplained. Combined with the
student demographic variables, just under 75% of the total differential may be attributable to
gender considerations.
The unadjusted and unexplained differentials in Table IV.8 are generally small for both
gender and race. The largest gap is -0.269 ratings points (-0.32 standard deviations) for the
unexplained differential between Other instructors and White instructors, and all of the
deviations for Female or Underrepresented minority instructors are less than half as large. The
non-demographic factors for both instructors and the classes they teach never account for much
of the ratings differentials by either gender or race. In fact, for nearly all of the decompositions
by race, the non-demographic factors widen the adjusted differentials instead of narrowing them
and thereby increase the size of the unexplained differential. This finding is consistent with the
earlier results reported in Section IV.C in which the estimated race-gender ratings differentials
were larger with non-demographic control variables (Table IV.5) than without them (Table
Page 30 of 35
IV.1). Factoring student ratings of learning outcomes and itemized teaching strengths into the
decomposition accounts for large percentages of the observed differentials, but our previous
caveats apply regarding the nature of the association between these disaggregated ratings and the
overall Student Rating of Instructors.
IV.E. Findings from a subsample of multi-section courses taught contemporaneously by the
same instructor
Our total sample of 4,297 classes includes a subsample of 440 multi-section courses
(accounting for 895 classes in the full sample) taught by the same instructor. Nearly all of them
are pairs, but three at a time are taught in a few cases. Since most of the non-demographic
control variables are constant across these sets of course sections, the subsample affords an
opportunity to focus more precisely on the relationship between ratings of instructors and the
demographic composition of the classes they teach. We have done that here by treating the
subsample as an unbalanced panel dataset for estimation purposes.
Table IV.9 reports the results. For ease of comparison, Panel A is a repeat of the full-sample
results from Table IV.6. The panel data estimates appear in Panel B. The impact of sample size is
apparent: none of the five estimated cross-group differentials that are statistically significant in
the full sample remain so in the smaller panel data subsample even though three of the estimates
increase in size. Instead, the only statistically significant finding in the panel data is a positive
estimated differential for the average rating of White Male instructors by Underrepresented
minority Male students. Consistent with the full-sample estimates, there are no statistically
significant estimated same-group differentials in the panel data sample. However, it is worth
noting that statistical significance for the panel data is a tall order with only 164 different
instructors sorted into six demographic groups that range in size from 16 to 55 instructors, with a
maximum of 21 for the four groups of nonwhite instructors.
Page 31 of 35
Table IV.9: Ratings differentials versus (1) sample mean rating for own-groups and (2) own-group benchmark rating for crossgroups, total sample versus sample of multi-section courses taoght by the same instructor
Panel A: All classes (from Table IV.6)
Panel B: Multi-section courses taught by the same instructor
4,297 classes
440 multi-section courses (895 classes)
443 clusters by instructor
164 clusters by instructor
WLS (by Evaluation Count)
XTREG (unbalanced panel data)
5.924
5.863
Student subgroup
Student subgroup
WMs
WFs
UMs
UFs
OMs
OFs
WMs
WFs
UMs
UFs
OMs
OFs
0.285
-0.176
0.585
0.137
0.022
0.165
0.257
-0.202
1.309
0.068
-0.026
0.850
WMi
(0.089) (0.386) (0.780) (0.959) (0.499) (0.084)
(0.191) (0.525) (0.006) (0.874) (0.972) (0.276)
-0.656
-0.036
0.564
-0.083
0.201
0.392
-0.676
0.030
0.050
0.266
0.057
0.839
WFi
(0.029) (0.771) (0.836) (0.694) (0.751) (0.028)
(0.056) (0.897) (0.932) (0.612) (0.942) (0.278)
-0.712
-0.421
0.458
-0.066
0.573
0.594
0.311
-0.657
0.164
0.278
0.477
0.248
UMi
(0.113) (0.167) (0.152) (0.698) (0.863) (0.025)
(0.331) (0.101) (0.531) (0.615) (0.664) (0.789)
-0.342
-0.039
-0.008
0.113
0.426
-0.683
0.255
-0.638
0.424
-0.175
-0.588
-0.105
UFi
(0.375) (0.895) (0.451) (0.766) (0.930) (0.843)
(0.459) (0.101) (0.469) (0.546) (0.436) (0.895)
-0.146
-0.140
-1.273
-1.261
0.367
-0.248
1.043
-0.110
-1.318
0.167
-0.146
-1.333
OMi
(0.681) (0.766) (0.032) (0.045) (0.349) (0.395)
(0.109) (0.891) (0.443) (0.902) (0.807) (0.204)
-0.478
0.003
-0.115
-0.136
0.227
-0.831
-0.456
-0.491
0.705
0.895
0.529
-0.555
OFi
(0.287) (0.991) (0.498) (0.665) (0.790) (0.101)
(0.338) (0.213) (0.189) (0.158) (0.414) (0.415)
Values in parentheses correspond to Pr(Ho): Ho for Panel A and diagonals of Panel B: Deviation from average = 0. Ho for Panel B off-diagonal
estimates: Deviation from column benchmark = 0.
Instructor subgroup
Unit of observation
Standard error adjustment
Estimation method
Sample mean SRI
The panel-data estimates themselves also differ considerably from their corresponding fullsample estimates. Most of the changes in the estimated cross-group differentials are sizeable,
averaging 0.49 ratings points in absolute value (0.59 standard deviations). Moreover, the pattern
of the estimates reverses: 18 of the 30 estimated cross-group differentials are negative for the full
sample and positive for the panel data. In short, the panel-data estimates here fail to provide
corroborating support for the full-sample results or evidence that class demographic composition
plays a role in the ratings of individual instructors across the classes they teach.
V. Concluding observations
We have utilized in this paper several well-established econometric techniques to explore
potential race-gender differentials in Student Ratings of Instructors for a large sample of student
evaluations from a diverse liberal arts college. Our results follow a ping-pong pattern. For our
full-sample estimates, initial small but statistically significant differentials become statistically
insignificant with appropriate clustering of the data; then a coherent pattern of sizeable and in
some cases statistically significant estimates of own-group and cross-group differentials when
student class composition is incorporated in turn becomes considerably less consistent, sizeable
Page 32 of 35
and statistically significant once non-demographic control variables are included. A Oaxaca
decomposition suggests that, at the aggregate level, non-demographic characteristics of
instructors and classes do not account for the observed ratings differentials by instructor gender
and race, a finding that is consistent with the possibility that the differentials arise from race and
gender considerations per se, although the differentials themselves are small. Panel-data
estimation applied to a subsample of multi-section courses taught by the same instructor yields
results that do not provide evidence of race-gender ratings differentials. It may be the case that
the same institutional commitment to diversity that has facilitated the relatively wide range of
pairings of instructor-student demographics in our dataset has also contributed to sample
selection bias by attracting students who themselves place a relatively high premium on
diversity.
In any event, robust and statistically significant findings related to race or gender differentials
must pass through a challenging gantlet that includes clustering of observations to adjust for
heteroskedasticity, demographic heterogeneity on both sides of the equation, and potential
control variables that risk omitted variable bias if excluded. Dataset challenges alone are
daunting. Our raw data consist of 74,072 student evaluations spanning seven academic years, but
the effective sample size shrivels once the data are suitably clustered into 4,297 classes taught by
443 instructors which, despite our relatively high faculty diversity, subdivide into six
demographic categories in which nonwhite instructors range in number from 26 to 42 overall and
only 9 to 14 when adjunct faculty are excluded. The persistence of discrimination remains an
important social concern, and teasing out the extent to which race and gender differentials reflect
discrimination remains a challenging methodological concern. Prospectively, student ratings of
instructors constitutes a rich source of information to explore these issues, but the results here
Page 33 of 35
illustrate the high hurdles encountered right at the starting point of determining the magnitude
and statistical significance of the differentials themselves. Much larger datasets, particularly
datasets which span multiple institutions, or data which include the race and gender of individual
student evaluators might well shed additional insight into whether the more meaningful finding
here regarding own-group and cross-group ratings differentials is the relatively large size
estimates in some specifications or their statistical insignificance in most cases.
Some pragmatic implications emerge from our analysis. The pattern of student ratings of
instructors in our sample is more aptly characterized as a salad bowl than a melting pot. Several
sizeable estimated differentials for specific student-instructor demographic pairings coexist with
overall ratings differences between instructor demographic groups that are generally not large
enough to play a material role in tenure and promotion decisions.3 The larger disaggregated
differentials may be important, however, in particular teaching situations, such as required
courses with limited instructor options for students to choose from. And for institutions
undergoing an increase in diversity, balanced progress on both sides of the lectern is
advantageous by giving diverse students the instructor options they may well value and at the
same time reducing the potential impact of student homogeneity on the ratings of diverse
instructors. It is advisable for tenure and promotion committees to be mindful of these
considerations, but it appears unnecessary, at least from the findings here, to make systematic
adjustments to average student ratings based on instructor demographics at relatively diverse
institutions.
3
Other Male instructors constitute the major possible exception, although our dataset includes
only nine Asian Male faculty on tenured/tenure-track appointment.
Page 34 of 35
References
1. Algozzine, B., et. al. (2004). Student evaluation of college teaching: A practice in search of
principles. College Teaching, 52 (4), 134-141.
2. Anderson, Kathryn H., and Siegfried, J. (1997). Gender differences in rating the teaching of
economics. Eastern Economic Journal, 23, 3, 347-357.
3. Anderson, K. J., & Smith, G. (2005). Students’ preconceptions of professors: Benefits and
barriers according to ethnicity and gender. Hispanic Journal of Behavior Sciences, 2,
184-201.
4. Arreola, R.A. (2000). Developing a comprehensive faculty evaluations system: A handbook
for college faculty and administrators on designing and operating a comprehensive
faculty evaluations system (2nd ed.). Bolton, MA: Anker.
5. Cashin, W.E. (1995) Student ratings of teaching: The research revisited. IDEA Paper No.
32. Manhattan, KS: Kansas State University Center for Faculty Evaluation &
Development.
6.
Centra, J.A., & Gaubatz, N.B. (2000). Is there gender bias in student evaluations of
teaching? Journal of Higher Education 70 (1), 17-33.
7. Feldman, K.A. (1993). College students’ views of male and female college teachers: Part IIEvidence from students’ evaluations of their classroom teachers. Research in Higher
Education: 34 (2), 151-211.
8. Gravestock, P., & Gregor-Greenleaf, E. (2008). Student course evaluations: Research,
models, and trends. Toronto: Higher Education Quality Council of Ontario, from
http://www.heqco.ca/SiteCollectionDocuments/Student Course Evaluations.pdf.
Page 35 of 35
9. Hamermesh, D.S., & Parker, A.M. (2205). Beauty in the classroom: Instructors’ pulchritude
and putative pedagogical productivity. Economics of Education Review, 24 (4), 369376.
10.
Hativa, N. (2013). Student Ratings of Instruction: Recognizing Effective Teaching. United
States: Oron Publications.
11. Oaxaca, Ronald (1973). Male-Female Wage Differences in Urban Labor Markets,
International Economic Review, 14 October 1973), 693-709.
12. Saunders, Kent T., and Saunders, Phillip (1999). The influence of Instructor Gender on
Learning and Instructor Ratings, Atlantic Economic Journal, 27,4, 460-473.
13. Smith, B.P. (2007). Student ratings of teaching effectiveness: An analysis of end-of-course
faculty evaluations. College Student Journal 41 (4), 788-800.
14. Smith, B.P., & Hawkins, B. (2011). Examining student evaluations of black college faculty:
Does race matter? The Journal of Negro Education 80 (2), 149-162.
15. Smith, G., & Anderson, K.J. (2005). Students’ ratings of professors: The teaching style
contingency for latino/a professors. Journal of Latinos and Education, 4, 115-136.
16. Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for
truth or a witch hunt in student ratings of instruction? In M. Theall, P.C. Abrami & L.A.
Mets (Eds.), The student rating debate: Are they valid? How can we best use them? New
Directions for Institutional Research (Vol. 109, pp. 45-56). San Francisco: Jossey-Bass.