|
SUMMARY NOTES ON "ON THE USE OF NUMERICALLY SCORED STUDENT
EVALUATIONS OF FACULTY" BY WILLIAM RUNDELL, Report of the Department
of Mathematics, Texas A&M University, 1997, 15 Pages
(http://www.math.tamu.edu/~william.rundell/teaching_evaluations/index.html)
Summarized by J. T. P. Yao (7/17/99)
"
Up until two years ago the Department of Mathematics
at Texas A&M University relied on a student evaluation of instructor
form that was heavily weighted towards verbal responses.
For a trial period a decision was made to utilize questionnaires
that were electronically scanned and the output easily condensed
to a few numbers that were (hopefully) indicative of the teaching
performance of the instructor, albeit from the students point
of view. From an administrative viewpoint, there is an obvious lure
to measuring teaching effectiveness by reduction to
a single number. Most often this number is the average of
all the individual components.
Reading 10,000 student questionnaires
each semester for 115 faculty and putting this into any kind of
context is virtually impossible. Even for the relatively few cases
where in-depth information must be acquired, it is often difficult
for evaluators to use solely written comments and yet put them in
a comparative perspective.
Since few departments have managed
to similarly quantify other possible measures of evaluation, the
single number stands out and eventually becomes the measure of teaching
effectiveness."
"The purpose of this article is to examine some of the issues
surrounding this method of analyzing the data; to come to some understanding
of the information obtained and to determine whether the distillation
down to a few numbers is a valid tool for the effective evaluation
of teaching. In short, we want to see if what seems to be an increasingly
common system of evaluation contains sufficient information of reliable
nature as to be useful as a means of determining raises, promotion
and tenure."
"The main part of the evaluation form consisted of ten questions
with choices ranging from 1 (low) to 5 (high). There was also a
space for comments.
the first five questions on the form
were the ones suggested jointly by the Student and Faculty Senates.
The other five questions were added by a department committee and
were meant to complement the other five.
Typically, these
evaluations were filled out in the last week of class, but prior
to the final examination. The questionnaires were processed by a
central measurement and testing service. Raw output consisted of
n words, each of length ten and consisting of an alphabet of the
characters 1, 2, 3, 4 and 5. Here n denotes the total number of
responses.
Also available was the answer to an eleventh question,
the students expected grade
, the total number enrolled
in the section, N,
and the response rate n/N varied markedly.
The mean response rate over all sections was 67.5% with a standard
deviation of 16.3%."
"At Texas A&M we teach very little precalculus mathematics
or material normally considered to be part of the high school curriculum.
These courses account for about 6% of our enrollment.
Average
total enrollment in these four courses is 5,200 students in the
fall semester and 4,500 in the spring.
The average class
size is 100. For the purposes of this study we will designate the
above classes as constituting Group A."
"There is a three-semester calculus sequence for Mathematics
and Science majors and a parallel, virtually identical in content,
one for engineering students. Average enrollment in these sequences
is 3,100 students in the fall and 2,600 in the spring. Class size
varies from about 30 in the mathematics/science sequence to a little
under a hundred students in the engineering sequence although the
latter has recitation sections of average class size 30. We will
designate these courses as Group B."
"Almost all of the students in Group B take a fourth semester
class in differential equations and most of them take at least one
further course in mathematics at the junior level.
The differential
equations course has an average enrollment of 700 per semester and
the other junior level courses have another 400. Class size is between
40 and 50 students. We will refer to these courses a being in Group
C."
"We also teach a variety of courses at the junior and senior
level primarily designated for mathematics majors; these will be
denoted by Group D."
"
The data for this paper was accumulated over the four
semesters the form was in use."
"Mean Evaluation Norm for Professorial Rank Faculty
Group Course Mean
A 3.35
B 3.81
C 3.91
D 4.27
Graduate 4.47
There are some flaws in interpreting the vast differences
in mean scores in this table as being totally due to differences
in the students. Both the undergraduate and graduate program committees
successfully lobby for certain instructors to teach the courses
for our majors so that the professors involved do not constitute
a random sample."
"An intelligent use of data would surely take this into account.
Would it also need to allow for the following?
- "On semester" classes tend to have better students
and higher grades than those in the "off semester."
- "The time of the day the class was taken.
Does that
suggest a different type of students in terms of he or she evaluates
is forced eventually to take a less popular time period? The average
grade point ratio given out in 8am sections is consistently lower
than other time periods, especially when compared to midmorning
sections.
- The percentage of students responding to the survey.
there is evidence that the demographics of those responding to
the survey is not representative of the sample.
- The percentage of Q-drops varies significantly between sections
and this is true at all levels of course.
Did instructor A who receive a mean rating of 4.3 but had
a 20% of his students Q-drop achieve a better customer satisfaction
norm than instructor B who had no Q-drops but a mean rating of 3.9?
Professor As mean rating puts him into the well above category,
while Bs places her at slightly below the department average.
Would you expect a correlation between the Q-drop rates and
mean evaluation?
"
"Folklore in the Department, and indeed amongst mathematics
faculty nationwide, has long held that there is a direct correlation
between student evaluations and grades, despite an extensive claim
in the Education literature to the contrary.
Care must be
taken in selecting the courses and choosing the scales. We avoided
courses that had strong coordination between sections, such as common
exam format since these, at least in theory, should have no instructor-dependent
variation in grades
On the other hand we wished to choose
courses with a large number of sections in order to get statistically
meaningful results.
"
[For detailed results and mathematical analyses of results, please
read the original paper.]
"The correlations are too high to accept the hypothesis that
grades and evaluations are unrelated."
"It should be noted here that the number of D and F grades
claimed by the students as their expected outcomes are considerably
lower than actually given out for the class as a whole. The other
grade values, while more in line with actuality, are over-inflated.
"
"Conclusions
We entered into the process of standardized evaluations with an
open mind and were hoping, as many have in the past, for a silver
bullet that would allow us to deal with the problem of evaluating
teaching in an objective manner. If this could also be combined
with a reduction in the workload of such a task then this was an
added bonus.
There is much information that can be gained from the numerically-based
responses and there is clearly a signal hidden in a background of
more single-valued information. How to filter this background is
much less clear. How to modify the responses in light of other information
about the course is even less clear.
However, the analysis we have performed on the data suggests that
the distillation of evaluations to a single number without taking
into account the many other factors can be seriously misleading.
The correlation between positive student evaluations and grades
awarded is sufficiently strong to indicate that a procedure based
on numerical scores such as we have described is surely going to
lead to grade inflation in the long run.
While the idea of tracking students progress through a sequence
of courses is an attractive means of evaluating faculty performance,
only a relatively small number of our enrollments in a given semester
is in a chain of courses sufficiently structures for data to be
collected. For various reasons some faculty do not teach these courses
at all. The negative correlation that our study seems to indicate
between the two measures of carryon success and what
we have discovered in this article is best described as short
term customer satisfaction, is very disturbing. If this is
indeed the situation, then the use of student evaluations as a primary
measure of teaching effectiveness, simply because it is easily normable,
is a very questionable practice."
Follow-up of Summary Notes: "On the Use of Numerically
Scored Student Evalutions of Faculty" by William Rundell,
Report of the Department of Mathematics, Texas A&M University,
1997, 15 Pages
The contents of the current "teaching evaluation" form
of the Department of Mathematics is given as follows:
"To the student: Please take a few minutes to make
some constructive comments to the following questions below. These
evaluations will be read by your instructor and the departmental
administration AFTER grades have been submitted. These evaluations
are considered, along with other criteria, such as classroom visitation
and curriculum materials, when making decisions on promotion and
merit raises.
- Likely grade in this class: ________. Major:_____________
- Please comment about the instructor: What did (s)he do that
you would want to remain the same? What did (s)he do that you
would change?
- Please comment about the course: What would you keep the same?
What would you change?"
In addition, Professor Arthur Hobbs (Mathematics, TAMU) kindly
offered the following corrections:
"There are some problems
partly arising from the date
of the article summarized. Prof. Yao's summary was in 1999 of a
1997 paper. Consequently, when he quoted from Rundell's paper up
until 2 years ago the Department of Mathematics at Texas A&M
University relied on a student evaluation of instructor form that
was heavily weighted towards verbal responses. most people
who read the summary probably interpreted that as up until
1997. In fact, until 1995, the Texas A&M University Mathematics
Department used a self-developed form which was heavily weighted
toward verbal responses but which did include a numerical component.
We switched to a form similar to the University-provided form in
1995 for a two-year experiment, which ended in 1997 with Rundell's
report. In 1997, as a result of our findings, we switched to a form,
which excluded numerical data altogether.
Thus the further quote Reading 10,000 student questionnaires
each semester for 115 faculty and putting this into any kind of
context is virtually impossible. Even for the relatively few cases
where in-depth information must be acquired, it is often difficult
for evaluators to use solely written comments and yet put them in
a comparative perspective. from Rundell's report is accurate
but misleading because it raises this sentence to an unjustified
prominence. Although Rundell was conceding in his report that administrators
would prefer not to read so many evaluations, in fact that is exactly
what is now being done in this department.
Moreover, we are very happy with the result. Here are some comments
Rundell made in a follow-up memorandum to the faculty in January
of 1998:
I have just completed reading the student evaluations of
faculty for the fall semester. ... It is a daunting task taking
many, many hours, but well worth the effort. ... We have been using
the "essay answer" evaluation form for enough time now
for me to get an adequate feel for the process. Here is a survey
of my impressions; I would like to hear yours.
"1. One cannot help but notice the overall positive tone!!
I would estimate that at least 80% of the professorial faculty have
80% of the evaluations strongly in favour of their teaching. One
sees this in the answers to the "What to improve/change"
and "recommend to another student" questions. This is
not to say that the remaining 20% of the faculty had poor evaluations,
but that the students saw more a mixture of qualities and failings.
In many cases they were able to articulate suggestions for improvement
in a very constructive and useful way. Indeed, I recall only one
faculty member whom the students predominately felt was doing a
very poor job in teaching.
...
"2. The information content is very high. Even when simply
seeking an overall impression there is substantially more available
than any amount of statistical tweaking on my part was able to extract
from the normative answer form we experimented with. For example,
with some teachers it clearly comes out they have a definite style
-- say they are very structured and organized, but this trait might
also be viewed as inflexibility. As you might expect, some students
just love this and gave the instructor rave reviews. Others comment
they thought the instructor was good, but did not care for the style
and for this reason probably would not recommend him/her. One can
easily see that this teacher would not be highly rated overall on
the normative questionnaire. However, there are no negative comments
of any substance."
Although the problems with the summary sent out are, I think, serious,
I also believe that the greater distribution of Rundell's article
resulting from the distribution of James Yao's summary is very likely
to be valuable to the entire community of scholars."
Return to
the Lohman homepage |