Exploring race and gender differentials in student ratings of instructors: Lessons from a diverse liberal arts college Robert L. Moore, Hanna Song Spinosa, James D. Whitney Occidental College April 2014 Abstract: This paper explores differences in student ratings of instructors by the race and gender of the instructor and also by the race-gender composition of students in each class. Our dataset is the largest and most recent in the literature to date, consisting of 74,072 student course evaluations submitted for 4,297 undergraduate classes taught by 443 instructors over Academic Years 2006–2012 at Occidental College, as well as detailed information on the instructor and the students enrolled in each class. Our paper differs further from previous research spanning a variety of disciplines in several important ways: 1) we explicitly focus on the race as well as the gender of instructors; 2) we examine a college with relatively high levels of race and gender diversity of both students and faculty; 3) we analyze our data using the econometric techniques that distinguish the empirical approach of the economics discipline, controlling for many nondemographic factors that can affect student ratings; and 4) we supplement our core econometric methodology with a Oaxaca decomposition and with a subsample analysis of 440 multi-section courses taught contemporaneously by a single instructor. Our main findings include the following: 1) overall class-average ratings differentials by instructor race and gender do not appear large enough in general to play a material role in Occidental College's tenure and promotion decisions; 2) for thorough analysis of potential race and gender ratings differentials, it is important to take into account not only the race and gender of class instructors but the demographic composition of the classes they teach as well; 3) several estimates of disaggregated student-instructor pairings (for example, white male student ratings of white female instructors, Page 2 of 35 and so on) are sizeable but only a few are statistically significant; and 4) credible and robust empirical results rest on a foundation of careful controls that include non-demographic factors that can affect student ratings of instructors, potential heterogeneity of respondents, and clustering of the data by class and instructor. I. Introduction Economists have explored a variety of cases in which race and gender differentials raise concerns about discriminatory outcomes, including wages and employment in labor markets, redlining in insurance markets, and fair housing in real estate markets. Similar concerns may arise in academic markets as well. Student ratings of instructors (SRIs) constitute potentially useful data to explore this issue, and abundant research in other disciplines utilizes SRIs to examine ratings differentials by gender but almost none addresses ratings differentials by race. Only a small amount of similar research appears in the economics literature to date, and what exists is almost exclusively focused on ratings differentials by gender, not race. The perspectives of varying disciplines enrich our collective understanding of important socioeconomics issues, and the distinctive contribution that economists can make derives from the econometric tools that the discipline applies to empirical research. That is the approach we take in this paper. At a practical level, tenure and promotion decisions depend in part on teaching effectiveness, and student ratings of instructors typically play an important role in the evaluation of teaching effectiveness. So, prospectively, ratings differentials by race and gender can matter for the career prospects of faculty. More broadly, student ratings of instructors may be able to help inform the ongoing social concern with the persistence of race and gender discrimination. Since SRIs are quantitative measures, they are amenable to statistical analysis. And since they are typically anonymous, they constitute a data source of responses that are likely to be relatively Page 3 of 35 free of self-censorship. Drawn from a relatively young population, they might also help provide a leading indicator about the future direction of social attitudes. The relatively high level of race and gender diversity of both students and instructors at Occidental College allows for a correspondingly wide range of demographic configurations of instructors and class composition, which facilitates statistical estimation of ratings differentials by the race and gender of instructors in combination with the demographic make-up of the classes they teach. The extent to which race and gender differentials, even after controlling for non-demographic factors that can also affect outcomes, correspond to discrimination remains a challenging empirical issue. The differentials may reflect actual learning differences (Hativa, 2013), such as role-model effects, rather than discrimination per se. Or they may reflect student reaction to differential treatment by their instructors rather than bias on their own part. Nonetheless, the first step to teasing out possible discrimination entails the estimation of the differentials themselves, and that is our aim here. The key conclusions that emerge from analysis of our particular dataset are mixed. In only a few cases are the empirical estimates of race and gender ratings differentials statistically significant, although in several cases of disaggregated student-instructor pairings (for example, white male student ratings of white female instructors, and so on) these estimates are sizeable. Moreover, demographic enrollment patterns tend to dilute the impact of these disaggregated estimates on the overall class-average ratings of instructors. The end result is that class-average student ratings in our dataset do not differ enough by instructor race and gender to warrant systematic ratings adjustments for tenure and promotion decisions, but do warrant a general attentiveness to particular teaching situations in which instructor and student demographics might matter. Our clearest findings are cautionary observations regarding the challenges that Page 4 of 35 research into the issues addressed in this paper must surmount in order to generate empirical results that are credible, robust and statistically significant. We briefly review the literature most directly related to our own research in Section II. We describe our data and methodology in Section III, and our key empirical results in Section IV. We conclude in Section V by highlighting our most important findings and offering some suggestions for future research. II. Literature review There is a vast literature on student ratings of instructors and how such ratings might, or might not, relate to teaching effectiveness. Hativa (2013), in her recent book, Student Ratings of Instruction: Recognizing Effective Teaching, lists no fewer than 139 references related to this topic. Included in this list are well over 40 citations concerning whether or not gender biases student ratings of instructors (SRIs). On the other hand, we found very few published studies on differences in student ratings of instructors by instructor race/ethnicity, an issue that our data and research methods allow us to explore in significant detail. Indeed, this issue isn’t even mentioned in the chapter of Hativa’s book that explores how various factors beyond the instructor’s control can affect student ratings of instructors, such as class size, academic discipline, gender of the instructor, etc. Our examination of the 139 references cited in Hativa revealed that only one (Hamermesh and Parker, 2005) was even tangentially related to the issue of student ratings of white versus minority instructors. Our own literature search uncovered only five published studies related to the issue of student ratings of instructor differentials by race of the instructor: Hamermesh and Parker (2005), the most closely related prior study for our own work on this topic), Smith (2007), Smith and Hawkins (2011), Anderson and Smith (2005), and Smith and Anderson (2005), with only the Page 5 of 35 first three using empirical evidence from actual courses.1 Two of these three studies, Smith (2007) and Smith and Hawkins (2011), were quite similar to each other in that they appear to use a very similar data set of student evaluations from the College of Education at a research institution in the Southern U.S. Both compared the average student ratings on two “global” items (overall value of course and overall teaching ability) for White, Black, and “Other” instructors, as well as the average ratings on 26 multidimensional items which address specific topics or a single aspect of instruction. One of the studies was based on 13,702 undergraduate student evaluation forms over a three-year period for 190 tenure-track faculty, 83% of whom were classified as White, 12% as Black, and 5% as “Other” (Asian, Latino, or Native American). The other study included a student sample about double that size and included graduate courses as well over a three year period, 2001-2004, and apparently included the same group of 190 faculty. For both studies, the authors concluded that “Black faculty received lower average ratings than White faculty and faculty identified as “Other” for both the multidimensional and the global items.” The average ratings for the multidimensional items were closer for Black and White faculty than were those of the global items. More specifically, in the study of undergraduates alone (Smith and Hawkins, 2011) the average rating for overall teaching ability on a five-point descending scale was 4.08 for White faculty, 3.44 for Black faculty, and 4.22 for Other racial 1 The other two studies examined the influence of professor and student characteristics on students’ perceptions of college professors based on a hypothetical syllabus for a social science course on “Race, Gender, and Inequality.” The syllabus was constructed to vary by teaching style, professor ethnicity and professor gender and used fictional names to represent Latino versus White professors, e.g, Lopez vs. Saunders. Students of different ethnicities and gender were then asked to rate the “instructors” of these hypothetical courses based on the syllabus. Page 6 of 35 groups. In the study that included both undergraduates and graduate classes (Smith, 2007), the means were 4.25 for White faculty, 3.65 for Black faculty and 4.16 for faculty from Other racial groups. The authors indicated that these differences in mean ratings were statistically significant at the 5% level for black vs. white faculty The authors argued that “the lower student ratings on the global items….were especially troublesome because these ratings have the power to affect faculty merit increases and careers” (Smith and Hawkins, 2011: 159). However, neither of these studies included (1) control variables for factors other than race of instructor that might influence student ratings or (2) statistical adjustments that apply to observations drawn from recurring survey data. Hamermesh and Parker (2005) addressed both of these issues in their study, incorporating instructor race into a multivariate regression analysis framework. The dataset covered 463 courses taught by 94 instructors at the University of Texas at Austin during the 2000-2002 academic years, with class-average ratings of instructors by over 16,000 total student evaluations. Only about 10% (numbering 9 or 10) of these instructors were classified as “minority.” Other than minority status, control variables for teacher characteristics included the instructor’s gender, whether they were on tenure track, and whether they were educated in a nonEnglish speaking country, in addition to the key instructor variable of interest in their particular study, namely the instructor’s composite beauty rating by students. Control variables for course characteristics included class size and whether the course was upper or lower division. The regression results indicated that, holding everything else the same, minority faculty were rated lower than white faculty at the University of Texas at Austin over this period of time, on the order of 0.25 points on a five-point scale (amounting to about 0.5 standard deviations). Page 7 of 35 As for the previous research on gender and student ratings of instructors, Hativa first summarizes a meta-analysis of approximately 36 studies conducted by Feldman (1993) by indicating that the majority of studies reported no significant differences in student ratings of instructor by gender of instructor. She also notes that “most other reviews of studies of genderSRI relationships have also concluded that these ratings have no strong or regular pattern of gender-based bias (Algozzine et al., (2004), Arreola, (2000), Cashin (1995), Feldman, (1992), Gravestock & Gregor-Greenleaf, (2008), and Theall & Franklin, (2001),” (Hativa, 2013: 81). Many of these studies use small samples and have few controls for other factors that have been shown to affect student ratings. An exception is that of Centra and Guabatz (2000) which utilizes data from the Student Instructional Report II developed by the Educational Testing Service (ETS) covering 741 classes in eight major discipline groups across about 20 different institutions that use this Student Instructional Report. The dataset is one of the few that includes the gender of the student who completes each evaluation form. The authors apply multivariate analysis of variance and conclude, Is there Gender Bias in Student Evaluations of Teaching? The results reflect some same gender preferences, particularly in female students rating female teachers. But the differences in ratings, though statistically significant, are not large and should not make much difference in personnel decisions. Moreover the higher evaluations received by female teachers from females, and in some instances from males as well (Natural Sciences in particular), could well be due to differences in teaching styles. Women in this study were more likely than men to use discussion rather than a lecture method, and as a group they appear to be a little more nurturing to students, as also reflected in certain scales in this study (p.32). Hamermesh and Parker (2005) included control variables for instructor gender as well as race in their multivariate regression analysis. They found that, holding the other variables constant, female faculty were rated lower than males by approximately 0.24 ratings points (about 0.5 standard deviations), a statistically significant gap at the 5% level. The authors noted that this result departed from the consensus in the literature on this question, i.e., that there is no statistically significant relationship between instructor gender and student ratings of instructors. Page 8 of 35 Our own search of the literature uncovered two other articles by economists, Anderson and Siegfried (1997) and Saunders and Saunders (1999), that examined samples of Principles of Economics courses and focused on the interaction between student ratings of instructors and the gender of instructors and students. Both studies applied econometric analysis and utilized the Test of Understanding of College Economics (TUCE III) data set. Saunders and Saunders (1999) also examined data from Indiana University Principles of Economics sections taught by associate instructors. Both data sets include some instructor characteristics in addition to gender, demographic and other information for the individual students completing the ratings, and exambased measures of student learning. The authors find statistically significant evidence of same gender preference only for the Indiana University data set for Principles of Microeconomics courses. The finding is not consistent over time and does not emerge from their analysis of Indiana University Macroeconomics courses or for the TUCE III samples of 20 micro and 19 micro classes. Anderson and Siegfried (1997), using 1990 TUCE III data for 87 Principles of Macroeconomics and 80 Principles of Microeconomics classes at 53 institutions, conclude that when compared to student learning, the evidence we summarize from student ratings…reveals no evidence of student bias against female instructors. If anything, there is some evidence that in micro students rate female instructors higher than male instructors while learning similar amounts from each and, in macro, students rate male and female instructors similarly in spite of learning less in the classes [taught by] women (pp. 355-6). III. Data and Methodology Our dataset consists of all student evaluations that include an overall student rating of instructor submitted for Occidental College full-credit classes (counting for 4 or more units) with enrollments above 5 students during the seven academic years from 2006 to 2012. The dataset totals 74,072 evaluations submitted for 4,297 classes taught by 443 instructors. Students fill out the individual course evaluations anonymously, so the form lacks information regarding the race and gender of individual respondents. However, information from the College’s Registrar’s Page 9 of 35 Office enabled us to calculate the overall race and gender composition of the students enrolled in the class. For each instructor, we added information provided by Occidental’s Office of Human Resources regarding their race and gender, whether they were on regular (tenured/tenure track) appointment or were part-time or full-time adjuncts, and their years of experience at Occidental. We feel Occidental College is particularly well suited for a case study of race and gender differentials in student ratings of instructors because of its relatively high level of diversity for both faculty and students. Table III.1 compares Occidental to other national arts colleges in terms of the race and gender composition of its full-time faculty and students. Based on Herfindahl indexes constructed from each college’s instructor race-gender employment shares and student race-gender enrollment shares, Occidental ranks 14th for faculty diversity and 8th for student diversity among US News national liberal arts colleges.2 Occidental’s comparatively high diversity in turn generates a comparatively wide range of demographic variation across the classes in our dataset. Table III.1: Race and gender composition of full-time faculty and students: Occidental College and average values for US News national liberal arts colleges, Academic Year 2009-10 Occidental College Male Female Total Percentage composition of full-time faculty: American Indian / Alaska native Asian / Hawaiian / Pacific Islander Black / African American Hispanic / Latino White non-Hispanic 0.0% 4.1% 4.6% 4.3% 39.7% 52.7% 0.0% 7.8% 2.0% 7.4% 30.1% 47.3% 0.0% 11.9% 6.6% 11.7% 69.8% 100.0% US News national liberal arts college average Male Female Total Occidental rank Male Female Total 0.1% 2.4% 3.0% 1.3% 48.3% 55.1% 0.2% 4.8% 6.4% 2.8% 85.9% 100.1% 42 37 26 12 212 169 42 9 83 4 215 91 70 13 35 4 239 0.7% 4.5% 12.3% 5.3% 77.1% 100.0% 27 7 114 19 205 133 37 12 108 22 217 131 32 10 124 15 228 0.1% 2.4% 3.4% 1.5% 37.6% 45.0% Total Percentage composition of full-time students: American Indian / Alaska native 0.6% 0.6% 1.2% 0.3% 0.4% Asian / Hawaiian / Pacific Islander 6.9% 9.8% 16.7% 1.7% 2.8% Black / African American 2.7% 3.2% 5.9% 4.8% 7.5% Hispanic / Latino 5.8% 8.1% 13.9% 2.1% 3.1% White non-Hispanic 27.9% 34.3% 62.3% 33.2% 44.0% Total 43.9% 56.1% 100.0% 42.1% 57.9% US News sample: 259 national liberal colleges with instructor data and 263 with student data 2 The Herfindahl indexes are calculated as the sum of the squares of the individual race-gender shares of employment for full-time faculty and of enrollment for full-time students. The data come from the IPEDS Data Center of the National Center for Education Statistics <http://nces.ed.gov/ipeds/datacenter/>. Page 10 of 35 In terms of overall methodology, like Smith (2007) and Smith and Hawkins (2011), we focus explicitly on how student ratings of instructors vary by race of instructor, and we extend the focus to gender as well. Like Centra and Guabatz (2000), we take into account the demographics of students as well as instructors. And, like Hamermesh and Parker (2005), whose methodology most closely matches the approach we take, we control for factors other than demographics that might account for differences in student ratings of instructors. Later in our paper, we borrow from the labor economics literature on the sources of earnings differentials by race and gender (Oaxaca, 1973) by undertaking Oaxaca decompositions of student ratings differentials to help further explore our main results. The basic structure of the regression equations we estimate has the form (1) Qn = + Xn + Zn + n. The subscript n denotes a sample observation which by default is a class as in Hamermesh and Parker (2005) but in some specified cases is an individual student evaluation. The dependent variable Q is the student rating of instructor (SRI), either the class average or an individual student as appropriate. More specifically, the SRI corresponds to the student rating on a sevenpoint descending scale in response to a course evaluation statement that reads, “Overall, the instruction for this course was excellent.” The right-hand side expression Xn denotes a vector summation of demographic variables for each observation (Xn) multiplied by their corresponding estimated coefficients (). An analogous interpretation applies to the term Zn when nondemographic control variables (Zn) are included in the equation. denotes a constant and n a random error term for observation n. In all of the equations we estimate, our key focus is on the variables X that directly relate to race-gender ratings differentials. We aggregate race into three categories: White (W) for white Page 11 of 35 Caucasians, Underrepresented minority (U) for African-Americans, Latinos and Native Americans, and Other (O) for those classified as Asian-Americans, Other or (race) Not Reported (Asian-Americans constitute 78% of this category). We denote RG = R=W,U,OG=M,F to represent the summation over the alternative demographic combinations of Race (R) and Gender (G). For equations that incorporate demographic information for instructors but not students, we estimate five coefficients for instructor (i) race and gender dummy variables DRGi measured against the excluded benchmark case of White Male instructors, so Xn takes the form RGi RGi·(DRGi)n. When we take into account student demographics, we do so for each class by interacting the instructor’s demographic dummy variable with the percentage enrollment in the instructor’s class of each student demographic group (PctRGsRGi)n, so Xn then becomes a thirty-five term expression, RGiRGs RGsRGi·(DRGi)n·(PctRGsRGi)n + n, excluding for estimation purposes a thirty-sixth possible pairing of White Male students and White Male instructors. IV. Results IV.A. Student Rating of Instructor (SRI) differentials by instructor race and gender As reported in Table IV.1, the pattern of average student ratings in the Occidental College dataset dovetails with previous findings: relative to White Male instructors, other race-gender instructor groups receive lower student ratings. Page 12 of 35 Table IV.1: Dependent variable: Class-average Student Rating of Instructor (SRI) Unit of observation Indiv. Evaluation Estimation method OLS Standard error adjustment None Class Average WLS (by Evaluation Count) 443 clusters by instructor 4,297 Coef. Pr(B=0) 5.998 0.000 Number of observations 74,072 Pr(B=0) Independent variables Coef. Constant (White Male instructor (WMi) 5.998 0.000 Differential versus WMi for: White Female instructor (WFi) -0.100 0.000 -0.100 0.294 Underrepresented minority Male instructor (UMi) -0.044 0.019 -0.044 0.747 Underrepresented minority Female instructor (UFi) -0.085 0.000 -0.085 0.560 Other Male instructor (OMi) -0.330 0.000 -0.330 0.009 Other Female instructor (OFi) -0.186 0.000 -0.186 0.268 Note: Total sample class-average SRI, weighted by number of respondents: mean = 5.924, standard deviation = 0.837 However, most of the ratings differentials are comparatively small in our sample. The differential for White Female instructors is -0.10 ratings points, amounting to -0.12 standard deviations of the respondent-weighted class-average SRI. The differentials are even smaller for Underrepresented minority faculty. Only for Other Male (primarily Asian) instructors does the differential (-0.33 ratings points, -0.39 standard deviations) approach the neighborhood of the size effects reported by Smith (2007), Smith and Hawkins (2011), Centra and Guabatz (2000) and Hamermesh and Parker (2005). Moreover, nearly all of the estimated differentials become statistically insignificant after adjusting the error terms for clustering of the data. Our sample consists of over 74,000 individual student course evaluations, but they are not independent observations. As survey data drawn from 4,297 classes taught by 443 instructors, it is crucial to adjust the standard errors for the resulting multilevel clustering of the data. The two right-hand columns of Table IV.1 report the adjusted results, achieved by first collapsing the data to 4,297 observations of respondentweighted class average ratings and then clustering the class averages by instructor. Only the average differential for Other Male instructors remains statistically significant. So overall, the Page 13 of 35 starting point for our empirical analysis is a dataset with smaller and predominantly statistically insignificant race-gender ratings differentials compared to the findings reported in past studies. IV.B. Estimating SRI differentials when student demographics vary across classes One reason for the small ratings differentials in our data could be that course enrollments by a diverse student body might adhere to a pattern that offsets and thereby masks larger underlying differentials, making it important to consider instructor and student demographics jointly. Saunders and Saunders (1999: 467), for example, found in their TUCE III dataset that students were overrepresented by gender in Principles of Economics classes taught by same-gender instructors. The students in our data are consistently overrepresented in courses taught by instructors of their own race and gender. Panel A of Table IV.2 reports the sample frequency of average course enrollments disaggregated by the race and gender of both instructors and students. The overall frequencies are recalibrated in Panel B to show the average race-gender composition of classes taught by each faculty demographic group and facilitate comparisons to the overall average class composition, reported in the bottom row of the panel. Own-group pairings appear along the diagonal, so, for example, the second row-second column diagonal entry reports that White Female students (WFs) on average account for 34.7% of the enrollments in classes taught by White Female instructors (WFi), compared to 30.8% of the enrollments in all classes. In every case, students are overrepresented in classes taught by instructors that match their own race and gender. Among the cross-group (off-diagonal) cases of overrepresented enrollments, Underrepresented minority students of both genders (UMs and UFs) are overrepresented in classes taught by Underrepresented minority instructors (UMi and UFi), and the same is true for Other (primarily Asian) students as well. Five of the six remaining cases of overrepresentation Page 14 of 35 are gender matched: both White Male (WMs) and Underrepresented minority Male students enrolled in classes taught by Other Male instructors (OMi), Other Male students (OMs) in classes taught by White Male instructors (WMi) and nonwhite female students (UFs and OFs) in classes taught by White Female instructors. Only one of the thirty-six group pairings constitutes overrepresentation that is unmatched for both race and gender: Underrepresented minority Male students enrolled in classes taught by Other Female instructors (OFi). Instructor subgroup Table IV.2: Average class composition of students by gender and ethnicity, overall and by instructor gender-ethnicity subgroup Panel A: Percentage distribution of course enrollments, total sample Student subgroup Instrtuctor share WMs WFs UMs UFs OMs OFs of enrollments WMi 11.7% 12.0% 3.3% 4.2% 3.8% 4.8% 39.7% WFi 7.0% 9.9% 2.4% 3.6% 2.1% 3.6% 28.5% UMi 2.0% 2.4% 0.9% 1.4% 0.7% 1.0% 8.5% UFi 2.2% 2.9% 1.0% 1.8% 0.6% 1.0% 9.5% OMi 1.5% 1.4% 0.5% 0.6% 0.7% 0.9% 5.6% OFi 2.1% 2.2% 0.7% 0.9% 1.1% 1.3% 8.3% Total % of enrollments 26.5% 30.8% 8.8% 12.4% 8.9% 12.5% Instructor subgroup Panel B: Average class race-gender composition by instructor subgroup Student subgroup WMs WFs UMs UFs OMs OFs WMi 29.5% 30.3% 8.3% 10.5% 9.5% 12.0% WFi 24.4% 34.7% 8.3% 12.5% 7.5% 12.7% UMi 23.8% 28.5% 10.6% 17.0% 8.3% 11.8% UFi 23.4% 30.5% 10.6% 19.0% 5.8% 10.7% OMi 27.0% 25.1% 9.2% 10.3% 13.0% 15.4% OFi 25.2% 26.8% 8.9% 10.7% 13.0% 15.5% Average % of enrollments 26.5% 30.8% 8.8% 12.4% 8.9% Total 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 12.5% To explore the relationship between SRI differentials and class composition by race and gender, we incorporated student race-gender class enrollment data into our regression equation. We then used the regression results to calculate estimated class-average SRIs for each of the Page 15 of 35 thirty-six possible race-gender pairings of instructors and students in order to make comparisons of differentials, as reported in Table IV.3. Table IV.3: Estimated average student rating of instructors (SRI) by race-gender subgroups of instructors and students Panel B: The diagonal entries serve as column benchmarks, and the other cells in each column report differentials versus Instructor subgroup the column benchmark weights** Enrollment Student subgroup Student subgroup wtd. avg.* WMs WFs UMs UFs OMs OFs WMs WFs UMs UFs OMs OFs 0.418 -0.035 0.313 -0.053 -0.615 -0.007 0.074 0.418 -0.316 -0.610 -0.659 -0.920 0.806 WMi 39.7% (0.019) (0.851) (0.337) (0.891) (0.122) (0.982) (0.120) (0.019) (0.170) (0.195) (0.247) (0.104) (0.181) -0.561 0.282 0.380 -0.184 -0.219 0.168 -0.026 -0.979 0.282 -0.543 -0.790 -0.523 0.981 WFi 28.5% (0.017) (0.033) (0.391) (0.634) (0.620) (0.482) (0.667) (0.001) (0.033) (0.338) (0.166) (0.376) (0.082) -0.469 -0.375 0.923 0.745 0.100 0.131 0.030 -0.888 -0.656 0.923 0.139 -0.204 0.944 UMi 8.5% (0.265) (0.319) (0.008) (0.003) (0.934) (0.710) (0.791) (0.053) (0.099) (0.008) (0.772) (0.874) (0.128) -0.337 0.021 0.184 0.606 0.430 -0.927 -0.011 -0.755 -0.260 -0.739 0.606 0.126 -0.114 UFi 9.5% (0.365) (0.938) (0.764) (0.142) (0.521) (0.150) (0.926) (0.067) (0.387) (0.292) (0.142) (0.872) (0.890) 0.138 -0.299 -1.014 -1.182 0.305 -0.284 -0.256 -0.280 -0.581 -1.937 -1.788 0.305 0.529 OMi 5.6% (0.749) (0.541) (0.110) (0.003) (0.446) (0.608) (0.012) (0.548) (0.251) (0.007) (0.002) (0.446) (0.482) -0.166 0.253 -0.003 -0.355 0.200 -0.813 -0.112 -0.585 -0.028 -0.926 -0.961 -0.104 -0.813 OFi 8.3% (0.749) (0.409) (0.997) (0.469) (0.624) (0.113) (0.446) (0.292) (0.932) (0.352) (0.132) (0.855) (0.113) **share of total Instructor subgroup weighted** average differential of each *derived from Instructor subgroup weighted** average differntial of each enrolled data in Table student subgroup from its respective benchmark student subgroup from the overall sample average: students in -0.072 0.041 0.272 -0.048 -0.224 -0.114 IV.2 Panel A. -0.813 -0.337 -0.712 -0.722 -0.560 0.761 dataset Least squares estimation of 4,297 observations of class-average ratings weighted by number of respondents and clustered by 443 instructors. Values in parentheses correspond to Pr(Ho): Ho for Panel A and diagonals of Panel B: Deviation from average = 0. Ho for Panel B off-diagonal estimates: Deviation from column benchmark = 0. Note: Overall sample class-average SRI, weighted by number of respondents: mean = 5.924, standard deviation = 0.837 Instructor subgroup Panel A: Each cell reports each subgroup's estimated class-average SRI differential versus the overall sample average Panel A of Table IV.3 reports estimated class-average differentials for each studentinstructor pairing relative to the overall sample average SRI. The highlighted values along the diagonal report own-group race-gender pairings. For example, the first diagonal entry indicates that White Male students rate White Male instructors 0.418 ratings points higher than the fullsample average rating of 5.924. Except for Other Females, all of the estimated own-group differentials are positive and sizeable, averaging 0.51 ratings points (0.61 standard deviations). The estimates are statistically significant at probabilities under five percent for Whites (WM and WF) and for Underrepresented minority Males (UM). Ratings differentials relative to the overall sample mean are driven by potential ratings differentials both between student groups and within them. Some race-gender student groups may “grade harder” than others across the board. That can certainly affect average instructor SRIs when enrollment patterns vary by instructor groups, but it can also coexist with an absence of race-gender ratings differentials when measured within the context of the “grading standards” Page 16 of 35 that different student groups might apply. The bottom row of data in Panel A of Table IV.3 reports the fixed-weight average SRI differentials for each student group, with the dataset share of total course enrollments for each instructor group serving as weights (reported in the righthand column of the table). The average differentials across the student groups span 0.50 ratings points (0.59 standard deviations), large enough to play an important role in the estimated disaggregated ratings. Panel B of Table IV.3 carries over the diagonal entries from Panel A but treats each one as a benchmark differential for its respective student group—specifically, a benchmark corresponding to how highly the students in that group rate instructors of their own race and gender. The offdiagonal entries in each column are re-calibrated as SRI differentials relative to the column benchmark, so they provide estimated averages of how the student group rates instructors of differing race and gender (cross-group benchmarked differentials) compared to how they rate instructors of their own race and gender. For example, the -0.316 entry at the top of the second column constitutes an estimate of how much lower on average White Female students rate White Male instructors compared to White Female instructors. In short, the off-diagonal entries measure estimated ratings differentials within each student demographic group. Other Female students remain an outlier in Panel B, but a consistent pattern spans the other student groups. Estimated cross-group ratings are lower than own-group ratings in 23 of 25 cases; the two exceptions are positive differentials for Underrepresented minority Male instructors by Underrepresented minority Female students and for Underrepresented minority Female instructors by Other Male students. The sizes of the estimated differentials are typically large compared to previously reported findings, averaging 0.63 ratings points (0.75 standard deviations), although only a few of the estimates are statistically significant, namely, lower Page 17 of 35 ratings of White Female instructors by White Male students and lower ratings of Other Male instructors by Underrepresented minority Male and Female students. “Large insignificant” effects may be the Oxymoron that best summarizes the estimates reported in Table IV.3, but the results nonetheless highlight the importance of accounting for both instructor and student demographics when estimating race-gender differentials in student ratings of instructors for samples in which race and gender diversity characterizes both students and instructors. For expository purposes, it is convenient to refer to group-specific differentials as if they correspond to the student ratings per se, but such a connection is unavoidably speculative here since our dataset does not include the actual race-gender identity of individual respondents. The empirical outlier group of Other Female students with its large negative differential (-0.81 ratings point) for ratings of Other Female instructors and higher ratings for all other instructor demographic groups provides an illustrative case in point. Based on our data and findings, we cannot rule out the possibility that Other Female students rate instructors in this fashion, but neither can we confirm it, and it appears inconsistent with the observed pattern of overrepresentation of Other Female students in classes taught by Other Female instructors. Our results in fact indicate only that the classes taught by Other Female instructors in our dataset are associated with lower ratings as the proportion of Other Female students in those classes rises. IV.C. Controlling for non-demographic factors that are likely to be associated with SRIs Although our dataset fails to include the race and gender of individual student respondents, it does include respondent-specific information about many other non-demographic factors that are likely to be associated with student ratings of instructors, which gives us the opportunity to reduce the risks of omitted variable bias. Our preferred specification incorporates these other factors, similar to the approach taken by Hamermesh and Parker (2005). Page 18 of 35 In their study of the relationship between student ratings and the perceived beauty of instructors, Hamermesh and Parker (2005) included two additional dummy control variables related to classes (Lower division, One-credit course) and four related to instructors (Female, Minority, Non-native English speaker, Tenure track). We incorporated a similar but more extensive set of class and instructor variables, listed in Table IV.4. Table IV.4: Descriptive data for non-demographic control variables Variable Sample frequency Course level: 100-level (dummy) 41% 200-300 level (excluded) 56% 400-level (dummy) 3% Course classification: Cultural Studies Program (dummy) 8% Arts and Humanities (dummy) 35% Science (dummy) 26% Social Sciences (excluded) 30% Seminar course (dummy) 14% Classiifcation of instructor Part-time adjunct (dummy) 17% Full-time adjunct (dummy) 21% Tenure/tenure track (excluded) 62% New offering by instructor (dummy) 18% Sample average Instructor years of experience at Occidental 11.3 Years of experience if greater than six 18.8 Total class enrollment 23.0 Average student seniority (1=Frosh, 4=Senior) 2.4 Percent graduate student enrollment 0.4% Percent enrolled for Core requirement 30% Percent enrolled for Major requirement 43% Evaluation response rate 90% Average grade awarded (A=4.00) 3.29 Average expected grade (A=4.00) 3.41 Following Hamermesh and Parker (2005), we first estimated the ratings equation with the instructor race-gender dummy variables but not the variables for student demographics. Table Page 19 of 35 IV.5 reports the results. Among the non-demographic variables, the most sizeable and statistically significant include positive estimated coefficients for the response rate and expected grade, and negative estimated coefficients for science courses, courses taught by adjunct instructors, new course offerings, and the percentage of students enrolled to fulfill core or major field requirements. The estimated race-gender differentials versus White Males average -0.26 standard deviations, 45% larger than the estimates without controls, as reported in Table IV.1, and the revised estimates are universally larger and more statistically significant. However, the average size effect measured in standard deviations is still only about half as large as Hamermesh and Parker (2005) found for their sample. And, when clustering specifically by instructor, as reported in the middle two columns of Table IV.5, the estimated differentials for our sample are statistically significant only for Other instructors. Page 20 of 35 Table IV.5: Dependent variable: Class average overall student rating of instruction (4,297 observations) Demographic variables Standard error adjustment Instructor only Robust 443 clusters by instructor Instructor plus student demographics 443 clusters by instructor Independent variables Coef. Pr(Ho) Coef. Pr(Ho) Coef. Pr(Ho) Constant 3.579 0.000 3.579 0.000 3.738 0.000 100-level course -0.082 0.035 -0.082 0.235 -0.094 0.163 400-level course -0.103 0.160 -0.103 0.272 -0.098 0.280 Cultural Studies Program course -0.116 0.193 -0.116 0.491 -0.107 0.509 Arts and Humanities course 0.060 0.094 0.060 0.537 0.077 0.421 Science course -0.223 0.000 -0.223 0.040 -0.219 0.046 Seminar course 0.122 0.009 0.122 0.084 0.125 0.066 Part-time adjunct instructor -0.357 0.000 -0.357 0.000 -0.355 0.000 Full-time adjunct instructor -0.201 0.000 -0.201 0.041 -0.216 0.028 Years of experience at Occidental 0.018 0.043 0.018 0.310 0.016 0.371 Years of experience if greater than six -0.022 0.006 -0.022 0.148 -0.020 0.184 New offering by instructor -0.221 0.000 -0.221 0.004 -0.218 0.003 Total enrollment 0.004 0.041 0.004 0.285 0.004 0.219 Average student seniority 0.040 0.144 0.040 0.350 0.035 0.396 Percent graduate student enrollment -0.210 0.440 -0.210 0.601 -0.218 0.604 Percent of students enrolled for Core requirement -0.430 0.000 -0.430 0.005 -0.425 0.004 Percent of students enrolled for Major requirement -0.279 0.000 -0.279 0.005 -0.276 0.005 Evaluation response rate 0.787 0.000 0.787 0.000 0.770 0.000 Average grade awarded 0.012 0.852 0.012 0.899 0.022 0.817 Average expected grade 0.588 0.000 0.588 0.000 0.590 0.000 Differential versus WMi for: White Female instructor (WFi) -0.107 0.000 -0.107 0.192 Race-gender Underrepresented minority Male instructor (UMi) -0.179 0.000 -0.179 0.122 differentials reported in Underrepresented minority Female instructor (UFi) -0.167 0.000 -0.167 0.186 Table IV.6 Other Male instructor (OMi) -0.349 0.000 -0.349 0.005 Other Female instructor (OFi) -0.284 0.000 -0.284 0.048 Note: Overall class-average rating, weighted by number of respondents: mean = 5.924, standard deviation = 0.837. Replacing the instructor race-gender dummy variables with the full set of student demographic variables has virtually no effect on the estimated sizes and significance levels of the non-demographic control variables, reported in the last two columns of Table IV.5. The estimated race-gender differentials with the non-demographic control variables included are reported in Table IV.6. Again excluding the outlier group of Other Female students and focusing first on Panel A, the most striking effect is a sharp reduction in the sizes and statistical significance of the diagonal entries that correspond to the own-group differentials versus the overall sample average SRI. These estimated differentials now average only 0.24 ratings points (0.28 standard deviations), compared to 0.51 ratings points when non-demographic variables were excluded, as reported in Table IV.3. None of the revised own-group estimates is Page 21 of 35 statistically significant at probabilities under five percent, compared to three such instances before. For the off-diagonal cross-group estimates, the average of the predominantly negative estimated differentials likewise shows a smaller deviation from the overall mean SRI, rising 0.04 ratings points from -0.12 to -0.08 with the addition of the non-demographic control variables. Table IV.6: Estimated average student rating of instructors (SRI) by race-gender subgroups of instructors and students, controlling for other factors that affect SRIs Panel A: Each cell reports each subgroup's estimated class-average SRI differential versus the overall sample average Instructor subgroup WMi WFi UMi UFi OMi OFi Student subgroup WMs WFs UMs UFs OMs OFs -0.212 0.585 0.137 0.022 0.165 0.285 (0.181) (0.061) (0.598) (0.949) (0.559) (0.089) -0.371 0.564 -0.083 0.201 0.392 -0.036 (0.102) (0.160) (0.803) (0.589) (0.086) (0.771) -0.427 -0.457 -0.066 0.573 0.594 0.458 (0.302) (0.118) (0.796) (0.608) (0.121) (0.152) -0.057 -0.075 -0.008 0.426 -0.683 0.113 (0.870) (0.783) (0.989) (0.441) (0.203) (0.766) 0.139 -0.175 -1.273 -1.261 -0.248 0.367 (0.660) (0.693) (0.084) (0.025) (0.596) (0.349) -0.193 -0.032 -0.115 -0.136 0.227 -0.831 (0.637) (0.904) (0.883) (0.746) (0.524) (0.101) Instructor subgroup weighted** average differential of each student subgroup from the overall sample average: Panel B: The diagonal entries serve as column benchmarks, and the other cells in each column report differentials versus Instructor the column benchmark subgroup weights** Student subgroup Enrollment wtd. avg.* WMs WFs UMs UFs OMs OFs 0.105 -0.176 0.127 0.024 -0.346 0.997 0.285 39.7% (0.010) (0.386) (0.780) (0.959) (0.499) (0.084) (0.089) -0.002 -0.656 0.105 -0.196 -0.167 1.223 -0.036 28.5% (0.972) (0.029) (0.836) (0.694) (0.751) (0.028) (0.771) -0.077 -0.712 -0.421 -0.180 0.205 1.426 0.458 8.5% (0.423) (0.113) (0.167) (0.698) (0.863) (0.025) (0.152) -0.063 -0.342 -0.039 -0.466 0.059 0.148 0.113 9.5% (0.560) (0.375) (0.895) (0.451) (0.930) (0.843) (0.766) -0.243 -0.146 -0.140 -1.731 -1.374 0.584 0.367 5.6% (0.013) (0.681) (0.766) (0.032) (0.045) (0.395) (0.349) -0.181 -0.478 0.003 -0.573 -0.250 -0.140 -0.831 8.3% (0.152) (0.287) (0.991) (0.498) (0.665) (0.790) (0.101) **share of total *derived from Instructor subgroup weighted** average differntial of each enrolled data in Table student subgroup from its respective benchmark IV.2 Panel A. students in -0.042 -0.153 0.351 -0.046 0.194 0.081 -0.543 -0.164 -0.117 -0.176 -0.184 0.994 dataset Least squares estimation of 4,297 observations of class-average ratings weighted by number of respondents and clustered by 443 instructors. Values in parentheses correspond to Pr(Ho): Ho for Panel A and diagonals of Panel B: Deviation from average = 0. Ho for Panel B off-diagonal estimates: Deviation from column benchmark = 0. Note: Overall sample class-average SRI, weighted by number of respondents: mean = 5.924, standard deviation = 0.837 The fall in the estimated differentials for own-groups and rise in the estimated differentials for cross-groups together serve to substantially reduce the size and statistical significance of the estimated cross-group benchmarked differentials reported in Panel B of Table IV.6. The average size of the estimated ratings differentials within student groups relative to their respective owngroup benchmark falls by nearly half, from 0.63 ratings points in Table IV.3 to 0.32 ratings points (0.38 standard deviations) once non-demographic explanatory variables are included. The pattern of results is less consistent as well, with six instances of positive differentials for crossgroup ratings in Table IV.6 compared to only two in Table IV.3. Both with and without nondemographic control variables, the negative estimated differentials for White Male students rating White Female instructors and Underrepresented minority students rating Other Male instructors are statistically significant at probabilities under five percent. For the outlier group of Page 22 of 35 Other Female students, adding non-demographic control variables results in positive estimated differentials that are statistically significant at slightly less than a three percent probability for ratings of White Female and Underrepresented minority Male instructors. Chart IV.1 illustrates the impact of the addition of the non-demographic control variables on the estimated race-gender differentials. The bold dashed horizontal line indicates the overall sample average rating (5.92). The thicker solid arrows show how the estimated own-group ratings change. The thinner solid arrows show how the estimated cross-group ratings change, and they constitute the average of the dotted-line arrows that correspond to the disaggregated crossgroup ratings. Except for the outlier case of Other Female students, the arrows indicate substantial convergence of the own-group and sub-group ratings when non-demographic factors are included as control variables. For the first four groups of White and Underrepresented minority students, the driving force is the large decline of the own- group ratings as the control variables move them closer to the overall sample average. For Other Male and Other Female students, the own-group ratings barely change, but the cross-group ratings rise substantially. This reversed pattern nonetheless promotes convergence for Other Male students, in line with the other student demographic groups. Only for Other Female students do the own-group and crossgroup ratings exhibit divergence, as that is the only student group for which the initial own-group rating is below its cross-group average. To paraphrase from previous research, the most pronounced impact of the addition of non-demographic control variables is to reduce estimates of student “same race-gender preferences” (Centra and Guabatz, 2000: 32). It is in this respect that omitted variable bias matters most here, and it matters enough to expose the fragility of results for specifications that fail to control for non-demographic factors. Page 23 of 35 Chart IV.1: Estimated race-gender SRI differentials before and after the addition of non-demographic control variables 7.500 7.000 Average estimated SRI 6.500 WMi WFi 6.000 UMi UFi OMi OFi 5.500 Cross-group avg 5.000 4.500 WMs1 WMs2 WFs1 WFs2 UMs1 UMs2 UFs1 UFs2 OMs1 OMs2 OFs1 OFs2 Student demographic group. 1=ratings before and 2=ratings after including non-demographic control variables The student evaluation forms provide additional self-reported information, such as the number of classes missed, self-assessment of the extent of knowledge and skills learned, and ratings of specific teaching strengths. However, considerable uncertainty applies to the questions of whether to include this self-reported information and, if so, how to interpret the resulting estimates. For example, if in fact gender per se plays no role in student ratings of instructors, but course organization differs by gender, then failure to include ratings of course organization results in a false impression that gender matters. If, instead, gender does matter and to the same extent in student ratings of both overall instruction and organization, then including student ratings of organization masks the role of gender in student ratings. In an effort to utilize the selfreported information while being mindful of these uncertainties, we ran estimates that progressively introduced self-reported information in three steps: Page 24 of 35 1. Self-reported behavior: (1) Hours worked outside class, (2) number of classes missed, and (3) extent of conversation of course material outside of class. 2. Self-reported learning outcomes: (1) contribution of course to knowledge, (2) contribution of course to skills, and (3) expected course grade. 3. Ratings of instructor effectiveness in specific areas: (1) communicating course goals, (2) fulfilling course goals, (3) being organized, (4) giving clear assignments, (5) giving clear grading criteria, (6) giving helpful feedback, (7) being clear, (8) stimulating intellectual enthusiasm, (9) inviting students to confer outside class, (10) inviting questions, (11) asking questions, and (12) responding well to questions. Since these data items vary across individual student evaluations, we used each evaluation rather than class averages as our unit of observation and estimated the equations as a multilevel model grouped by class and clustered by instructor. Table IV.7a reports the results for the ratings differentials. Adding the self-reported student behavior variables has only a modest effect on the estimated differentials: a less than 7 percent median decline of the estimated size of the differentials. Self-reported learning outcomes have a much larger effect, prompting a median decline of another 46 percent, and the ratings of specific teaching strengths have an even larger effect, resulting in a further median decline of 56 percent. Taken together, the addition of all of the self-reported variables is associated with an 82 percent median decline in the size of the estimated differentials. A small number of estimated differentials remain statistically significant, but the sizes are uniformly small, averaging less than 0.05 ratings points for both the own-group differentials versus the sample mean and the crossgroup differentials versus their respective benchmarks. Page 25 of 35 Table IV.7a: Ratings differentials versus (1) sample mean rating for own-groups and (2) own-group rating for cross-groups: the impact of including self-reported information Observations Sample mean Student Rating of Instructor White Male students evaluating… White Male instructors White Female instructors Underrepresented minority Male instructors Underrepresented minority Female instructors Other Male instructors Other Female instructors White Female students evaluating… White Male instructors White Female instructors Underrepresented minority Male instructors Underrepresented minority Female instructors Other Male instructors Other Female instructors Underrepresented minority Male students evaluating… White Male instructors White Female instructors Underrepresented minority Male instructors Underrepresented minority Female instructors Other Male instructors Other Female instructors Underrepresented minority Female students evaluating… White Male instructors White Female instructors Underrepresented minority Male instructors Underrepresented minority Female instructors Other Male instructors Other Female instructors Other male students evaluating… White Male instructors White Female instructors Underrepresented minority Male instructors Underrepresented minority Female instructors Other Male instructors Other Female instructors Other Female students evaluating… White Male instructors White Female instructors Underrepresented minority Male instructors Underrepresented minority Female instructors Other Male instructors Other Female instructors Includes student demographics and previous list of non-demographic control variables Plus self-reported student behavior (hours, absences, conversations) Plus self-reported learning outcomes (knowledge, skills) Plus student ratings of teaching strengths 71,467 70,263 68,108 66,240 5.922 5.922 5.929 5.938 Diff. Pr(Ho) Diff. Pr(Ho) Diff. Pr(Ho) Diff. Pr(Ho) 0.274 -0.572 -0.759 -0.441 -0.031 -0.494 0.077 0.028 0.079 0.250 0.931 0.245 0.274 -0.547 -0.803 -0.344 -0.020 -0.490 0.089 0.032 0.058 0.356 0.956 0.240 0.137 -0.280 -0.557 -0.099 0.011 -0.298 0.124 0.062 0.039 0.656 0.963 0.218 0.094 -0.109 -0.193 -0.041 0.107 -0.041 0.002 0.040 0.044 0.614 0.220 0.573 0.008 -0.048 -0.377 0.109 -0.328 0.043 0.967 0.694 0.225 0.723 0.449 0.889 0.010 -0.044 -0.389 0.120 -0.296 0.024 0.960 0.715 0.187 0.684 0.504 0.936 0.001 -0.092 -0.108 0.079 -0.269 -0.019 0.994 0.194 0.578 0.651 0.198 0.906 0.007 -0.073 -0.017 -0.018 -0.050 0.032 0.878 0.013 0.820 0.785 0.438 0.658 -0.163 -0.286 0.668 -0.756 -1.539 -0.624 0.688 0.596 0.022 0.212 0.043 0.451 -0.169 -0.131 0.581 -0.679 -1.261 -0.635 0.683 0.803 0.048 0.255 0.088 0.432 0.134 0.031 0.247 -0.235 -0.220 -0.036 0.643 0.918 0.261 0.490 0.628 0.917 0.063 0.028 0.076 -0.084 0.036 -0.075 0.591 0.812 0.408 0.601 0.840 0.614 -0.093 -0.265 0.093 0.081 -1.467 -0.096 0.836 0.613 0.835 0.827 0.012 0.856 -0.023 -0.301 0.120 0.039 -1.369 -0.013 0.958 0.556 0.784 0.911 0.018 0.979 -0.018 -0.113 0.095 0.072 -0.872 0.078 0.949 0.704 0.757 0.753 0.018 0.810 -0.009 0.000 0.112 -0.029 -0.241 0.010 0.918 0.997 0.235 0.680 0.103 0.955 -0.193 -0.053 0.014 0.095 0.244 0.030 0.665 0.910 0.990 0.872 0.469 0.947 -0.240 -0.045 0.086 -0.002 0.248 0.056 0.575 0.919 0.930 0.997 0.432 0.894 -0.218 -0.026 0.028 -0.123 0.166 -0.048 0.445 0.926 0.966 0.750 0.440 0.853 -0.114 -0.158 -0.056 -0.235 0.218 -0.135 0.363 0.199 0.804 0.123 0.033 0.279 0.936 1.494 1.329 0.281 0.724 -0.989 0.098 0.006 0.051 0.684 0.265 0.054 0.860 1.415 1.237 0.281 0.613 -0.921 0.141 0.013 0.074 0.688 0.369 0.086 0.359 0.572 0.345 0.152 0.155 -0.305 0.210 0.036 0.379 0.667 0.655 0.217 0.001 0.004 -0.101 0.071 -0.137 -0.023 0.992 0.974 0.530 0.662 0.315 0.817 However, it is evident from Table IV.7b that the same impact pattern pertains to the estimated coefficients for our initial set of non-demographic control variables as well. A median decline of 8 percent results from the addition of the behavior variables, another 41 percent from the learning outcomes variables, and another 65 percent from the ratings of teaching strengths. The overall median decline totals 85 percent. Page 26 of 35 Table IV.7b: The impact of including self-reported information on estimated coefficients for non-demographic variables Included variables: 100-level course 400-level course Cultural Studies Program course Arts and Humanities course Science course Seminar course Part-time adjunct instructor Full-time adjunct instructor Years of experience at Occidental Years of experience if greater than six New offering by instructor Total enrollment Average student seniority Percent graduate student enrollment Percent of students enrolled for Core requirement Percent of students enrolled for Major requirement Evaluation response rate Average grade awarded Average expected grade Average hours worked outside of class Average number of classes missed Average rating: discussed course material outside of class Average rating: Course contribution to knowledge Course contribution to skills Instructor communicated goals Instructor fulfilled goals Course organization Clear assignments Clear grading criteria Helpful feedback Clear explanations of concepts Motivated intellectual enthusiasm Instructor invited individual meetings Instructor invited questions Instructor asked questions Instructor was responsive to questions Demographics Coef. Pr(Ho) -0.087 0.150 -0.162 0.051 -0.232 0.047 0.051 0.587 -0.285 0.008 0.081 0.241 -0.334 0.000 -0.196 0.041 0.011 0.536 -0.016 0.290 -0.184 0.012 0.002 0.563 0.066 0.090 0.038 0.925 -0.041 0.003 0.045 0.000 0.708 0.000 0.247 0.002 0.324 0.000 + behavior Coef. Pr(Ho) -0.084 0.150 -0.230 0.007 -0.263 0.024 0.066 0.468 -0.289 0.005 0.061 0.375 -0.320 0.000 -0.172 0.067 0.010 0.575 -0.015 0.329 -0.184 0.009 0.002 0.478 0.047 0.215 -0.116 0.751 -0.025 0.061 0.018 0.126 0.625 0.000 0.228 0.003 0.286 0.000 0.010 0.000 -0.016 0.000 0.138 0.000 + outcomes Coef. Pr(Ho) -0.052 0.135 -0.059 0.232 -0.180 0.016 0.024 0.620 -0.174 0.002 0.024 0.584 -0.192 0.000 -0.102 0.073 0.007 0.507 -0.010 0.264 -0.088 0.028 0.003 0.104 -0.023 0.331 0.031 0.871 0.010 0.344 -0.012 0.189 0.336 0.000 0.212 0.000 0.070 0.000 0.000 0.875 0.003 0.083 0.028 0.000 + ratings Coef. Pr(Ho) 0.006 0.609 -0.021 0.233 -0.027 0.297 0.033 0.029 -0.016 0.275 -0.010 0.495 -0.049 0.004 -0.035 0.039 0.003 0.280 -0.003 0.323 -0.034 0.012 0.000 0.898 -0.018 0.027 -0.016 0.814 0.006 0.400 0.004 0.444 0.073 0.022 0.073 0.000 -0.002 0.669 -0.001 0.162 0.002 0.038 -0.003 0.143 0.421 0.374 0.115 0.069 -0.030 0.162 0.141 0.029 0.030 0.087 0.182 0.171 0.023 0.015 0.050 0.133 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.000 0.000 The estimated coefficients for all of the variables for self-reported outcomes and ratings of teaching strength are statistically significant and all but one are positive, as expected. Moreover, the ranking of the sizes of the estimated relationships between the itemized teaching strengths and the overall SRI has a commonsense pattern to it, with clarity, intellectual enthusiasm, fulfilling course goals, organization and responsiveness to questions topping the list. However, the association between the overall SRI and the additional self-reported information variables is virtually tautological, so it is hardly surprising that including them swamps the estimated effects of both the non-demographic variables and the demographic variables included in our original specification. Page 27 of 35 IV.D. Applying a Oaxaca decomposition to average SRI differentials by gender and race The Oaxaca (1973) decomposition was initially utilized by labor economists to explore the possibility of wage discrimination by gender. The decomposition entails a two-step process in which (1) a wage-determination regression equation is estimated for a benchmark group, say male workers ( MXM), and (2) the average comparison-group wage, in this case for female workers, is predicted by evaluating the benchmark equation at the comparison-group means for ̅ F). The value of the gap between this predicted wage and the the explanatory variables (M𝑿 benchmark wage constitutes the amount of the gender wage differential that is “explained” by differences in productivity characteristics (“endowments”); any remaining gap is unexplained by productivity differences and may be due to gender discrimination in the labor market. Here, we apply a Oaxaca decomposition to SRI differentials by gender and race. We again incorporate student self-reported information in the same sequence as in Section IV.7, so we likewise again estimate a multilevel model of individual student evaluations grouped by class and clustered by instructor. Table IV.8 reports the decomposition results, Panel A for gender and Panel B for race. Page 28 of 35 Table IV.8. Blinder-Oaxaca decomposition of overall student rating of instructor (SRI) differentials by gender and race Predicted SRI when average female / nonwhite instructor characteristics are substituted into the estimated benchmark regression equation which includes non-demographic control variables plus variables for … Plus: Subgroup-specific average student self-reported data for: +Learning outcome +Teaching strength +Behavior ratings ratings Panel A: By gender: Benchamark = Male instructors Avg.SRI Sample Avg.SRI Sample Avg.SRI Sample Benchmark respondents 5.953 37,741 5.961 36,637 5.969 35,609 Respondents for Female instructors 5.887 32,522 5.893 31,471 5.903 30,631 Total differential versus Benchmark -0.066 -0.067 -0.067 Accounted for by differences in… Amount Percent Amount Percent Amount Percent 1. Instructor non-demographic factors -0.002 -2.7% 0.001 1.2% -0.001 -2.2% 2. Student self-reported information and ratings -0.019 -28.4% -0.068 -100.4% -0.044 -65.6% 3. Class non-demographic characteristics 0.000 -0.7% 0.004 5.5% 0.004 6.4% 0.005 8.0% 4. Student race-gender class composition -0.014 -20.8% -0.013 -19.4% -0.006 -9.0% -0.008 -12.4% Predicted average SRI (M'X̅F) 5.938 5.923 5.892 5.921 Unexplained differential -0.051 -77.4% -0.036 -55.0% 0.001 1.7% -0.018 -27.7% Panel B: By race: Benchmark = White instructors Avg.SRI Sample Avg.SRI Sample Avg.SRI Sample Avg.SRI Sample Benchmark respondents 5.953 48,645 5.952 47,786 5.959 46,399 5.969 45,096 Panel B1: Respondents for Underrepresented minority instructors 5.933 12,886 5.935 12,683 5.944 12,250 5.952 11,922 Total differential versus Benchmark -0.019 -0.017 -0.015 -0.017 Accounted for by differences in… Amount Percent Amount Percent Amount Percent Amount Percent 1. Instructor non-demographic factors 0.066 340.9% 0.061 349.4% 0.042 268.7% 0.008 44.9% 2. Student self-reported information and ratings -0.003 -16.5% -0.032 -209.1% -0.006 -34.9% 3. Class non-demographic characteristics 0.029 151.6% 0.033 192.0% 0.022 139.4% 0.003 18.1% 4. Student race-gender class composition -0.004 -19.3% -0.005 -28.8% 0.007 42.4% -0.002 -12.3% Predicted average SRI (W 'X̅U) 6.044 6.039 5.996 5.971 Unexplained differential -0.111 -573.2% -0.104 -596.2% -0.053 -341.5% -0.019 -115.9% Panel B2: Respondents for Other instructors 5.754 9,936 5.754 9,794 5.763 9,459 5.772 9,222 Total differential versus Benchmark -0.199 -0.199 -0.196 -0.196 Accounted for by differences in… Amount Percent Amount Percent Amount Percent Amount Percent 1. Instructor non-demographic factors 0.036 17.9% 0.031 15.6% 0.021 10.7% -0.002 -1.0% 2. Student self-reported information and ratings -0.017 -8.3% -0.124 -63.3% -0.203 -103.2% 3. Class non-demographic characteristics 0.019 9.6% 0.022 11.1% 0.016 8.0% -0.002 -0.9% 4. Student race-gender class composition 0.015 7.6% 0.013 6.4% 0.012 6.1% 0.008 4.1% Predicted average SRI (W 'X̅O) 6.022 6.001 5.883 5.770 Unexplained differential -0.269 -135.0% -0.248 -124.8% -0.121 -61.5% 0.002 1.0% Note: Variables include: 1. Instructor non-demographic factors: adjunct status, experience, new course preparation, response rate, GPA and expected grade; 2. Student self-reported information and ratings: (1) behavior: hours, absences, extent of out-of-class course conversation, and (2) ratings of learning outcomes and (3) ratings of itemized teaching strenghts; 3. Class non-demographic characteristics: course level and division, enrollment, type if seminar, average student seniority, and percentages enrolled for Core and for major requirements; 4. Student race-gender class composition: percentage enrollment of White Females, Undeerrepresented minority Males, Undeerrepresented minority Females, Other Males and Other Females. Benchmark student demographics Avg.SRI Sample 5.953 38,381 5.887 33,086 -0.066 Amount Percent -0.001 -1.1% The Oaxaca decomposition as applied here includes some complexities not encountered in the analysis of gender wage differentials, so it is instructive to first interpret the pattern of results reported in Table IV.8. Consider, for example, the second set of columns in Panel A, the student ratings differential by gender for the specification which includes student self-reported data for study hours, absences and the extent of out-of-class conversations about course material. Overall, the average SRI for female instructors is 0.066 points lower than for male instructors. The first category of explanatory variables, the non-demographic instructor factors such as teaching experience, is most analogous to productivity characteristics in wage determination models, and here it accounts for less three percent of the total ratings differential. Self-reported student behavior variables account for another 28 percent of the differential and may serve as proxies for Page 29 of 35 instructor effectiveness in engaging students, although we hasten to note as before that it is unclear whether the association between the overall SRI and other student self-reported information reflects causation or codetermination. Non-demographic class characteristics such as enrollment and course discipline have a positive value of .004 ratings points (5.5% of the total differential) which suggests that Male instructors facing class characteristics that match what Female instructors experience on average would receive a higher rather than lower rating, thereby widening instead of narrowing the adjusted gender differential. The demographic composition of enrolled students accounts for 19.4% of the ratings differential which helps to close the ratings gap while simultaneously falling into the category of ratings factors that are linked to race and gender. After controlling for all of these variables, .036 ratings points, amounting to 55% of the unadjusted differential, remains unexplained. Combined with the student demographic variables, just under 75% of the total differential may be attributable to gender considerations. The unadjusted and unexplained differentials in Table IV.8 are generally small for both gender and race. The largest gap is -0.269 ratings points (-0.32 standard deviations) for the unexplained differential between Other instructors and White instructors, and all of the deviations for Female or Underrepresented minority instructors are less than half as large. The non-demographic factors for both instructors and the classes they teach never account for much of the ratings differentials by either gender or race. In fact, for nearly all of the decompositions by race, the non-demographic factors widen the adjusted differentials instead of narrowing them and thereby increase the size of the unexplained differential. This finding is consistent with the earlier results reported in Section IV.C in which the estimated race-gender ratings differentials were larger with non-demographic control variables (Table IV.5) than without them (Table Page 30 of 35 IV.1). Factoring student ratings of learning outcomes and itemized teaching strengths into the decomposition accounts for large percentages of the observed differentials, but our previous caveats apply regarding the nature of the association between these disaggregated ratings and the overall Student Rating of Instructors. IV.E. Findings from a subsample of multi-section courses taught contemporaneously by the same instructor Our total sample of 4,297 classes includes a subsample of 440 multi-section courses (accounting for 895 classes in the full sample) taught by the same instructor. Nearly all of them are pairs, but three at a time are taught in a few cases. Since most of the non-demographic control variables are constant across these sets of course sections, the subsample affords an opportunity to focus more precisely on the relationship between ratings of instructors and the demographic composition of the classes they teach. We have done that here by treating the subsample as an unbalanced panel dataset for estimation purposes. Table IV.9 reports the results. For ease of comparison, Panel A is a repeat of the full-sample results from Table IV.6. The panel data estimates appear in Panel B. The impact of sample size is apparent: none of the five estimated cross-group differentials that are statistically significant in the full sample remain so in the smaller panel data subsample even though three of the estimates increase in size. Instead, the only statistically significant finding in the panel data is a positive estimated differential for the average rating of White Male instructors by Underrepresented minority Male students. Consistent with the full-sample estimates, there are no statistically significant estimated same-group differentials in the panel data sample. However, it is worth noting that statistical significance for the panel data is a tall order with only 164 different instructors sorted into six demographic groups that range in size from 16 to 55 instructors, with a maximum of 21 for the four groups of nonwhite instructors. Page 31 of 35 Table IV.9: Ratings differentials versus (1) sample mean rating for own-groups and (2) own-group benchmark rating for crossgroups, total sample versus sample of multi-section courses taoght by the same instructor Panel A: All classes (from Table IV.6) Panel B: Multi-section courses taught by the same instructor 4,297 classes 440 multi-section courses (895 classes) 443 clusters by instructor 164 clusters by instructor WLS (by Evaluation Count) XTREG (unbalanced panel data) 5.924 5.863 Student subgroup Student subgroup WMs WFs UMs UFs OMs OFs WMs WFs UMs UFs OMs OFs 0.285 -0.176 0.585 0.137 0.022 0.165 0.257 -0.202 1.309 0.068 -0.026 0.850 WMi (0.089) (0.386) (0.780) (0.959) (0.499) (0.084) (0.191) (0.525) (0.006) (0.874) (0.972) (0.276) -0.656 -0.036 0.564 -0.083 0.201 0.392 -0.676 0.030 0.050 0.266 0.057 0.839 WFi (0.029) (0.771) (0.836) (0.694) (0.751) (0.028) (0.056) (0.897) (0.932) (0.612) (0.942) (0.278) -0.712 -0.421 0.458 -0.066 0.573 0.594 0.311 -0.657 0.164 0.278 0.477 0.248 UMi (0.113) (0.167) (0.152) (0.698) (0.863) (0.025) (0.331) (0.101) (0.531) (0.615) (0.664) (0.789) -0.342 -0.039 -0.008 0.113 0.426 -0.683 0.255 -0.638 0.424 -0.175 -0.588 -0.105 UFi (0.375) (0.895) (0.451) (0.766) (0.930) (0.843) (0.459) (0.101) (0.469) (0.546) (0.436) (0.895) -0.146 -0.140 -1.273 -1.261 0.367 -0.248 1.043 -0.110 -1.318 0.167 -0.146 -1.333 OMi (0.681) (0.766) (0.032) (0.045) (0.349) (0.395) (0.109) (0.891) (0.443) (0.902) (0.807) (0.204) -0.478 0.003 -0.115 -0.136 0.227 -0.831 -0.456 -0.491 0.705 0.895 0.529 -0.555 OFi (0.287) (0.991) (0.498) (0.665) (0.790) (0.101) (0.338) (0.213) (0.189) (0.158) (0.414) (0.415) Values in parentheses correspond to Pr(Ho): Ho for Panel A and diagonals of Panel B: Deviation from average = 0. Ho for Panel B off-diagonal estimates: Deviation from column benchmark = 0. Instructor subgroup Unit of observation Standard error adjustment Estimation method Sample mean SRI The panel-data estimates themselves also differ considerably from their corresponding fullsample estimates. Most of the changes in the estimated cross-group differentials are sizeable, averaging 0.49 ratings points in absolute value (0.59 standard deviations). Moreover, the pattern of the estimates reverses: 18 of the 30 estimated cross-group differentials are negative for the full sample and positive for the panel data. In short, the panel-data estimates here fail to provide corroborating support for the full-sample results or evidence that class demographic composition plays a role in the ratings of individual instructors across the classes they teach. V. Concluding observations We have utilized in this paper several well-established econometric techniques to explore potential race-gender differentials in Student Ratings of Instructors for a large sample of student evaluations from a diverse liberal arts college. Our results follow a ping-pong pattern. For our full-sample estimates, initial small but statistically significant differentials become statistically insignificant with appropriate clustering of the data; then a coherent pattern of sizeable and in some cases statistically significant estimates of own-group and cross-group differentials when student class composition is incorporated in turn becomes considerably less consistent, sizeable Page 32 of 35 and statistically significant once non-demographic control variables are included. A Oaxaca decomposition suggests that, at the aggregate level, non-demographic characteristics of instructors and classes do not account for the observed ratings differentials by instructor gender and race, a finding that is consistent with the possibility that the differentials arise from race and gender considerations per se, although the differentials themselves are small. Panel-data estimation applied to a subsample of multi-section courses taught by the same instructor yields results that do not provide evidence of race-gender ratings differentials. It may be the case that the same institutional commitment to diversity that has facilitated the relatively wide range of pairings of instructor-student demographics in our dataset has also contributed to sample selection bias by attracting students who themselves place a relatively high premium on diversity. In any event, robust and statistically significant findings related to race or gender differentials must pass through a challenging gantlet that includes clustering of observations to adjust for heteroskedasticity, demographic heterogeneity on both sides of the equation, and potential control variables that risk omitted variable bias if excluded. Dataset challenges alone are daunting. Our raw data consist of 74,072 student evaluations spanning seven academic years, but the effective sample size shrivels once the data are suitably clustered into 4,297 classes taught by 443 instructors which, despite our relatively high faculty diversity, subdivide into six demographic categories in which nonwhite instructors range in number from 26 to 42 overall and only 9 to 14 when adjunct faculty are excluded. The persistence of discrimination remains an important social concern, and teasing out the extent to which race and gender differentials reflect discrimination remains a challenging methodological concern. Prospectively, student ratings of instructors constitutes a rich source of information to explore these issues, but the results here Page 33 of 35 illustrate the high hurdles encountered right at the starting point of determining the magnitude and statistical significance of the differentials themselves. Much larger datasets, particularly datasets which span multiple institutions, or data which include the race and gender of individual student evaluators might well shed additional insight into whether the more meaningful finding here regarding own-group and cross-group ratings differentials is the relatively large size estimates in some specifications or their statistical insignificance in most cases. Some pragmatic implications emerge from our analysis. The pattern of student ratings of instructors in our sample is more aptly characterized as a salad bowl than a melting pot. Several sizeable estimated differentials for specific student-instructor demographic pairings coexist with overall ratings differences between instructor demographic groups that are generally not large enough to play a material role in tenure and promotion decisions.3 The larger disaggregated differentials may be important, however, in particular teaching situations, such as required courses with limited instructor options for students to choose from. And for institutions undergoing an increase in diversity, balanced progress on both sides of the lectern is advantageous by giving diverse students the instructor options they may well value and at the same time reducing the potential impact of student homogeneity on the ratings of diverse instructors. It is advisable for tenure and promotion committees to be mindful of these considerations, but it appears unnecessary, at least from the findings here, to make systematic adjustments to average student ratings based on instructor demographics at relatively diverse institutions. 3 Other Male instructors constitute the major possible exception, although our dataset includes only nine Asian Male faculty on tenured/tenure-track appointment. Page 34 of 35 References 1. Algozzine, B., et. al. (2004). Student evaluation of college teaching: A practice in search of principles. College Teaching, 52 (4), 134-141. 2. Anderson, Kathryn H., and Siegfried, J. (1997). Gender differences in rating the teaching of economics. Eastern Economic Journal, 23, 3, 347-357. 3. Anderson, K. J., & Smith, G. (2005). Students’ preconceptions of professors: Benefits and barriers according to ethnicity and gender. Hispanic Journal of Behavior Sciences, 2, 184-201. 4. Arreola, R.A. (2000). Developing a comprehensive faculty evaluations system: A handbook for college faculty and administrators on designing and operating a comprehensive faculty evaluations system (2nd ed.). Bolton, MA: Anker. 5. Cashin, W.E. (1995) Student ratings of teaching: The research revisited. IDEA Paper No. 32. Manhattan, KS: Kansas State University Center for Faculty Evaluation & Development. 6. Centra, J.A., & Gaubatz, N.B. (2000). Is there gender bias in student evaluations of teaching? Journal of Higher Education 70 (1), 17-33. 7. Feldman, K.A. (1993). College students’ views of male and female college teachers: Part IIEvidence from students’ evaluations of their classroom teachers. Research in Higher Education: 34 (2), 151-211. 8. Gravestock, P., & Gregor-Greenleaf, E. (2008). Student course evaluations: Research, models, and trends. Toronto: Higher Education Quality Council of Ontario, from http://www.heqco.ca/SiteCollectionDocuments/Student Course Evaluations.pdf. Page 35 of 35 9. Hamermesh, D.S., & Parker, A.M. (2205). Beauty in the classroom: Instructors’ pulchritude and putative pedagogical productivity. Economics of Education Review, 24 (4), 369376. 10. Hativa, N. (2013). Student Ratings of Instruction: Recognizing Effective Teaching. United States: Oron Publications. 11. Oaxaca, Ronald (1973). Male-Female Wage Differences in Urban Labor Markets, International Economic Review, 14 October 1973), 693-709. 12. Saunders, Kent T., and Saunders, Phillip (1999). The influence of Instructor Gender on Learning and Instructor Ratings, Atlantic Economic Journal, 27,4, 460-473. 13. Smith, B.P. (2007). Student ratings of teaching effectiveness: An analysis of end-of-course faculty evaluations. College Student Journal 41 (4), 788-800. 14. Smith, B.P., & Hawkins, B. (2011). Examining student evaluations of black college faculty: Does race matter? The Journal of Negro Education 80 (2), 149-162. 15. Smith, G., & Anderson, K.J. (2005). Students’ ratings of professors: The teaching style contingency for latino/a professors. Journal of Latinos and Education, 4, 115-136. 16. Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for truth or a witch hunt in student ratings of instruction? In M. Theall, P.C. Abrami & L.A. Mets (Eds.), The student rating debate: Are they valid? How can we best use them? New Directions for Institutional Research (Vol. 109, pp. 45-56). San Francisco: Jossey-Bass.
© Copyright 2026 Paperzz