Do student test scores provide solid basis to evaluate teachers?

Edward Haertel, emeritus professor of education, presents his findings on value-added models at an Oct. 21 lecture at Stanford.

October 18, 2013

David Plotnikoff

Many states and school districts should rethink their heavy reliance on student test score gains to evaluate teachers, recommends Edward Haertel, emeritus professor in the Stanford Graduate School of Education.

Haertel, one of the nation's leading experts on educational assessment and accountability systems, believes the so-called "value added models" must be used with considerable caution.

Haertel will discuss the findings from his analysis, "Reliability and Validity of Inferences About Teachers Based on Student Test Scores," at an Oct. 21 brown bag session in the CERAS 101 lecture hall from 12 p.m. to 1:30 p.m. A webcast of the session will also be available.

Haertel's report comes at a time when the use of this method is rising as states add teacher accountability systems in pursuit of federal "Race to the Top" funding.

He says teacher effectiveness scores derived from student test results are so unreliable that they should never be used as the primary factor in making high-stakes individual personnel decisions such as merit pay or teacher ranking

Haertel presented his findings earlier this year in the Educational Testing Service's annual William Angoff Memorial Lecture. A video of the talk and an edited transcript were recently posted online.

In a recent interview, Haertel noted that the VAMs are so complex and arcane and so removed from classroom practice that they provide no useful feedback for teachers.

"It's as if we've thrown up our hands and said effective teaching is this mystical, magical thing you can't explain," he said. "All we can do is look for it statistically and credit people when we find it. And fire people when we don't find it."

When used to rank teacher performance, some statistical models have yielded wildly unpredictable results from year to year, leading teachers to wonder if their rankings are random or meaningless.

"An evaluation is only useful if it's perceived as being credible," Haertel said. "If you have a system that's not trusted it's going to demoralize people. And it's going to lead potentially strong teachers to not consider teaching as a career."

He noted that ranking teachers based on a single year's student test scores is "sorting mostly on noise" and even looking at three consecutive years of metrics yielded a reliability coefficient of 0.56, "much lower than we would demand for making other kinds of consequential decisions."

Loss of faith in the assessment process is not the only unintended consequence of VAMs.

Haertel says toxic byproducts of these systems may include an over-emphasis on test prep, at the expense of more valuable kinds of learning; intensified teacher competition and a decrease in cooperation; and, any number of ways to game the system.

"The mechanisms for assigning students to teachers are really complex and not well understood," he said. "An especially pernicious kind of gaming has to do with teachers seeking assignment of children who are likely to score well. The high stakes testing may cause the teachers to resent the students who can't score well. And that's so profoundly to the detriment of the students most in need. We really need to worry about that."

Any functional VAM, Haertel writes in the report, relies on simplifying assumptions, including that the composition of a class will not affect the scores of an individual student. Yet peer effects -- the influence of classmates -- are a significant factor, one that statistical models cannot fully parse.

A classroom full of gifted, hard-charging, competitive students will create an atmosphere primed for high achievement. Conversely, a room with unmotivated, disruptive or disengaged students will produce just the opposite.

The result, Haertel writes, is that "VAMs will not simply reward or penalize teachers according to how well or how poorly they teach. They will also reward or penalize teachers according to which students they teach and which schools they teach in."

Outside factors

When it comes to student achievement, teachers are important. And school resources and instructional support are important. But their influence pales in comparison to that of factors outside the school.

Haertel cites research estimating that out-of-school elements account for 60 percent of the variance in student scores while the influence of teachers was responsible for around 9 percent.

Haertel said that while the proponents of VAM systems have made efforts to refine the formulas to account for student demographics and out-of-school variables such as family income and parents' education, "we just don't have any way of knowing how good a job we're doing with those adjustments. You don't know how much bias remains."

Attempts to adjust for systematic bias are never perfect, and the more variation exists across both the schools where teachers are working and the children they're working with, the poorer these statistical adjustments are likely to be, Haertel said.

He believes the best approach is to ensure that comparisons are done within homogeneous groups. Comparing teachers within a single school or district is more accurate than applying the same formula to data from every student in a state, for example. This is one way to ameliorate the distorting effects of social stratification.

However making comparisons only within single schools or districts to account for social stratification comes with its own tradeoffs: It fails to account for the overall, average differences in teacher effectiveness from one school or district to another.

The VAMs are, in principle, one component of a larger system for teacher evaluation. In practice, however, the other pieces -- such as classroom observation and principal evaluations -- may receive little weight.

"If there's no variation on the principal's evaluation of the teachers, then effectively all of the weight, all the information for rank ordering and differentiating among teachers is coming from that value-added component," said Haertel. "There are places where if a principal gives a teacher an evaluation that is out of line with the VAM score the principal then has to write a justification for having done that. The VAMs trump other measures."

Given the damning case he presents, Haertel's conclusions are measured and prescriptive: teacher VAM scores should not be a deciding factor in personnel decisions. There should be no fixed weight assigned to the scores, he says, and principals and teachers should have the option to ignore the scores entirely if they have good cause to suspect they are invalid.

The bottom line for Haertel is the sobering realization that no amount of statistical processing is going to achieve alchemy.

"This is a data problem, not an analysis problem," he said. "The information just isn't there in the available test data to do a really good job of distinguishing between strong and weak teachers. The sources of bias and sources of distortion are just too large."

David Plotnikoff writes frequently for the Graduate School of Education.