SUMMARY NOTES ON "ON THE USE OF NUMERICALLY SCORED STUDENT EVALUATIONS OF FACULTY" BY WILLIAM RUNDELL, Report of the Department of Mathematics, Texas A&M University, 1997, 15 Pages

(http://www.math.tamu.edu/~william.rundell/teaching_evaluations/index.html)

Summarized by J. T. P. Yao (7/17/99)

"… Up until two years ago the Department of Mathematics at Texas A&M University relied on a student evaluation of instructor form that was heavily weighted towards verbal responses. … For a trial period a decision was made to utilize questionnaires that were electronically scanned and the output easily condensed to a few numbers that were (hopefully) indicative of the teaching performance of the instructor, albeit from the student’s point of view. From an administrative viewpoint, there is an obvious lure to measuring ‘teaching effectiveness’ by reduction to … a single number. Most often this number is the average of all the individual components. … Reading 10,000 student questionnaires each semester for 115 faculty and putting this into any kind of context is virtually impossible. Even for the relatively few cases where in-depth information must be acquired, it is often difficult for evaluators to use solely written comments and yet put them in a comparative perspective. … Since few departments have managed to similarly quantify other possible measures of evaluation, the single number stands out and eventually becomes the measure of teaching effectiveness."

"The purpose of this article is to examine some of the issues surrounding this method of analyzing the data; to come to some understanding of the information obtained and to determine whether the distillation down to a few numbers is a valid tool for the effective evaluation of teaching. In short, we want to see if what seems to be an increasingly common system of evaluation contains sufficient information of reliable nature as to be useful as a means of determining raises, promotion and tenure."

"The main part of the evaluation form consisted of ten questions with choices ranging from 1 (low) to 5 (high). There was also a space for comments. … the first five questions on the form were the ones suggested jointly by the Student and Faculty Senates. The other five questions were added by a department committee and were meant to complement the other five. … Typically, these evaluations were filled out in the last week of class, but prior to the final examination. The questionnaires were processed by a central measurement and testing service. Raw output consisted of n words, each of length ten and consisting of an alphabet of the characters 1, 2, 3, 4 and 5. Here n denotes the total number of responses. … Also available was the answer to an eleventh question, the student’s expected grade …, the total number enrolled in the section, N, … and the response rate n/N varied markedly. The mean response rate over all sections was 67.5% with a standard deviation of 16.3%."

"At Texas A&M we teach very little precalculus mathematics or material normally considered to be part of the high school curriculum. These courses account for about 6% of our enrollment. … Average total enrollment in these four courses is 5,200 students in the fall semester and 4,500 in the spring. … The average class size is 100. For the purposes of this study we will designate the above classes as constituting Group A."

"There is a three-semester calculus sequence for Mathematics and Science majors and a parallel, virtually identical in content, one for engineering students. Average enrollment in these sequences is 3,100 students in the fall and 2,600 in the spring. Class size varies from about 30 in the mathematics/science sequence to a little under a hundred students in the engineering sequence although the latter has recitation sections of average class size 30. We will designate these courses as Group B."

"Almost all of the students in Group B take a fourth semester class in differential equations and most of them take at least one further course in mathematics at the junior level. … The differential equations course has an average enrollment of 700 per semester and the other junior level courses have another 400. Class size is between 40 and 50 students. We will refer to these courses a being in Group C."

"We also teach a variety of courses at the junior and senior level primarily designated for mathematics majors; these will be denoted by Group D."

"… The data for this paper was accumulated over the four semesters the form was in use."

"Mean Evaluation Norm for Professorial Rank Faculty

Group Course Mean

A 3.35

B 3.81

C 3.91

D 4.27

Graduate 4.47

… There are some flaws in interpreting the vast differences in mean scores in this table as being totally due to differences in the students. Both the undergraduate and graduate program committees successfully lobby for certain instructors to teach the courses for our majors so that the professors involved do not constitute a random sample."

"An intelligent use of data would surely take this into account. Would it also need to allow for the following?

  1. "On semester" classes tend to have better students and higher grades than those in the "off semester." …
  2. "The time of the day the class was taken. … Does that suggest a different type of students in terms of he or she evaluates is forced eventually to take a less popular time period? The average grade point ratio given out in 8am sections is consistently lower than other time periods, especially when compared to midmorning sections.
  3. The percentage of students responding to the survey. … there is evidence that the demographics of those responding to the survey is not representative of the sample. …
  4. The percentage of Q-drops varies significantly between sections and this is true at all levels of course.

… Did instructor A who receive a mean rating of 4.3 but had a 20% of his students Q-drop achieve a better ‘customer satisfaction’ norm than instructor B who had no Q-drops but a mean rating of 3.9? Professor A’s mean rating puts him into the well above category, while B’s places her at slightly below the department average. … Would you expect a correlation between the Q-drop rates and mean evaluation? …"

"Folklore in the Department, and indeed amongst mathematics faculty nationwide, has long held that there is a direct correlation between student evaluations and grades, despite an extensive claim in the Education literature to the contrary. … Care must be taken in selecting the courses and choosing the scales. We avoided courses that had strong coordination between sections, such as common exam format since these, at least in theory, should have no instructor-dependent variation in grades … On the other hand we wished to choose courses with a large number of sections in order to get statistically meaningful results. …"

[For detailed results and mathematical analyses of results, please read the original paper.]

"The correlations are too high to accept the hypothesis that grades and evaluations are unrelated."

"It should be noted here that the number of D and F grades claimed by the students as their expected outcomes are considerably lower than actually given out for the class as a whole. The other grade values, while more in line with actuality, are over-inflated. …"

"Conclusions

We entered into the process of standardized evaluations with an open mind and were hoping, as many have in the past, for a silver bullet that would allow us to deal with the problem of evaluating teaching in an objective manner. If this could also be combined with a reduction in the workload of such a task then this was an added bonus.

There is much information that can be gained from the numerically-based responses and there is clearly a signal hidden in a background of more single-valued information. How to filter this background is much less clear. How to modify the responses in light of other information about the course is even less clear. …

However, the analysis we have performed on the data suggests that the distillation of evaluations to a single number without taking into account the many other factors can be seriously misleading. The correlation between positive student evaluations and grades awarded is sufficiently strong to indicate that a procedure based on numerical scores such as we have described is surely going to lead to grade inflation in the long run.

While the idea of tracking student’s progress through a sequence of courses is an attractive means of evaluating faculty performance, only a relatively small number of our enrollments in a given semester is in a chain of courses sufficiently structures for data to be collected. For various reasons some faculty do not teach these courses at all. The negative correlation that our study seems to indicate between the two measures of ‘carryon success’ and what we have discovered in this article is best described as ‘short term customer satisfaction,’ is very disturbing. If this is indeed the situation, then the use of student evaluations as a primary measure of teaching effectiveness, simply because it is easily normable, is a very questionable practice."

 

Follow-up of Summary Notes: "On the Use of Numerically Scored Student Evalutions of Faculty" by William Rundell, Report of the Department of Mathematics, Texas A&M University, 1997, 15 Pages

The contents of the current "teaching evaluation" form of the Department of Mathematics is given as follows:

"To the student: Please take a few minutes to make some constructive comments to the following questions below. These evaluations will be read by your instructor and the departmental administration AFTER grades have been submitted. These evaluations are considered, along with other criteria, such as classroom visitation and curriculum materials, when making decisions on promotion and merit raises.

  1. Likely grade in this class: ________. Major:_____________
  2. Please comment about the instructor: What did (s)he do that you would want to remain the same? What did (s)he do that you would change?
  3. Please comment about the course: What would you keep the same? What would you change?"

In addition, Professor Arthur Hobbs (Mathematics, TAMU) kindly offered the following corrections:

"There are some problems … partly arising from the date of the article summarized. Prof. Yao's summary was in 1999 of a 1997 paper. Consequently, when he quoted from Rundell's paper ‘up until 2 years ago the Department of Mathematics at Texas A&M University relied on a student evaluation of instructor form that was heavily weighted towards verbal responses.’ most people who read the summary probably interpreted that as ‘up until 1997.’ In fact, until 1995, the Texas A&M University Mathematics Department used a self-developed form which was heavily weighted toward verbal responses but which did include a numerical component.

We switched to a form similar to the University-provided form in 1995 for a two-year experiment, which ended in 1997 with Rundell's report. In 1997, as a result of our findings, we switched to a form, which excluded numerical data altogether.

Thus the further quote ‘Reading 10,000 student questionnaires each semester for 115 faculty and putting this into any kind of context is virtually impossible. Even for the relatively few cases where in-depth information must be acquired, it is often difficult for evaluators to use solely written comments and yet put them in a comparative perspective.’ from Rundell's report is accurate but misleading because it raises this sentence to an unjustified prominence. Although Rundell was conceding in his report that administrators would prefer not to read so many evaluations, in fact that is exactly what is now being done in this department.

Moreover, we are very happy with the result. Here are some comments Rundell made in a follow-up memorandum to the faculty in January of 1998:

‘I have just completed reading the student evaluations of faculty for the fall semester. ... It is a daunting task taking many, many hours, but well worth the effort. ... We have been using the "essay answer" evaluation form for enough time now for me to get an adequate feel for the process. Here is a survey of my impressions; I would like to hear yours.

"1. One cannot help but notice the overall positive tone!! I would estimate that at least 80% of the professorial faculty have 80% of the evaluations strongly in favour of their teaching. One sees this in the answers to the "What to improve/change" and "recommend to another student" questions. This is not to say that the remaining 20% of the faculty had poor evaluations, but that the students saw more a mixture of qualities and failings. In many cases they were able to articulate suggestions for improvement in a very constructive and useful way. Indeed, I recall only one faculty member whom the students predominately felt was doing a very poor job in teaching.

...

"2. The information content is very high. Even when simply seeking an overall impression there is substantially more available than any amount of statistical tweaking on my part was able to extract from the normative answer form we experimented with. For example, with some teachers it clearly comes out they have a definite style -- say they are very structured and organized, but this trait might also be viewed as inflexibility. As you might expect, some students just love this and gave the instructor rave reviews. Others comment they thought the instructor was good, but did not care for the style and for this reason probably would not recommend him/her. One can easily see that this teacher would not be highly rated overall on the normative questionnaire. However, there are no negative comments of any substance."

Although the problems with the summary sent out are, I think, serious, I also believe that the greater distribution of Rundell's article resulting from the distribution of James Yao's summary is very likely to be valuable to the entire community of scholars."

 

 

Return to the Lohman homepage

© 2001 The Lohman Professorship all rights reserved. Last modified