Do High-Stakes Tests Improve Learning?
Do High-Stakes Tests Improve Learning?
Test-based incentives, which reward or sanction schools, teachers, and students based on students’ test scores, have dominated U.S. education policy for decades. But a recent study suggests that they should be used with caution and carefully evaluated.
The United States has long performed at a middling level on international assessments of students’ math, reading, and science knowledge, trailing many other high-income countries. In their efforts to improve K-12 education, U.S. policymakers have increasingly turned to offering incentives—either to schools, to teachers, or to students themselves—to increase students’ standardized test scores.
For example, the No Child Left Behind (NCLB) law, which has governed public education for more than 10 years, sanctions schools whose students do not perform well on standardized tests. More recently, states and school districts have experimented with awarding bonuses to teachers if their students’ test scores climb. Twenty-five states target the incentives to students themselves by requiring them to pass an exit exam before receiving their diploma.
All of these policies share a fundamental principle: They reward or sanction students, teachers, or schools based on how well students score on standardized tests. Policymakers hope that by holding various players in the education system accountable for how much students learn, they will be motivated to improve student performance. But do test-based incentives actually drive improvements in student learning?
In an effort to answer that question, a recent study by the National Research Council took a comprehensive look at the available research on how incentives affect student learning. The study committee, composed of experts in education, economics, and psychology, examined a range of studies on the effects of many types of incentive programs. What it found was not encouraging: The incentive systems that have been carefully studied have had only small effects, and in many cases no effect, on student learning.
Measuring student learning
At best, any test can measure students’ knowledge of only a subset of the content in a particular subject area; it is also generally more difficult to design test items at higher levels of cognitive complexity. These limitations take on greater significance when incentives are tied to the test results. Research has shown that incentives can encourage teachers to “teach to the test” by narrowing their focus to the material most likely to appear on the test. As a result, their students’ scores may be artificially inflated because the score reflects their knowledge of only part of the material the students should know about the subject.
For example, if teachers move from covering the full range of material in eighth-grade mathematics to focusing only on the portion included on the test, their students’ test scores may rise even as their learning in the untested part of the subject stays the same or even declines.
In measuring how incentives affect student learning of a subject, it is important to look at students’ scores not on the high-stakes test that is tied to the incentives, but at low-stakes tests that are designed to provide a general picture of the quality of learning and do not have direct consequences for schools, teachers, or students. Because there is no incentive that would motivate teachers to narrow their instruction to the materials tested on low-stakes tests, the scores on those tests, such as the National Assessment of Educational Progress (NAEP), are less likely to be inflated and can give a more reliable picture of student learning in a subject area. In conducting its review of the research, the committee focused mainly on studies that based their assessment on low-stakes tests.
The committee also limited its evaluation to studies that allowed researchers to draw causal conclusions about the effects of test-based incentives. This means that studies had to have a comparison group of students, teachers, or schools that were not subject to incentives or rewards, and that individuals or groups could not self-select into the comparison group. In addition, the committee looked only at studies of programs that had existed long enough to supply meaningful results, which means that some programs, particularly many involving performance pay for teachers, were too new to evaluate.
Effects small, variable
The committee examined research on 15 programs with a range of configurations to assess the effects when incentives are given to schools, teachers, and students. Findings on some of these incentive programs are summarized below, and the effect sizes of all of them are shown in Figure 1.
Incentives for schools. Many state programs, as well as NCLB, reward or sanction schools based on the improvements made by their students. Under NCLB, for example, schools that do not show adequate yearly progress in improving student test scores face escalating consequences. Schools must first file improvement plans, make curriculum changes, and offer students school choice or tutoring; if progress is not shown, they are required to restructure in various ways. Some programs tie incentives to test score gains among students at all scoring levels, whereas others tie the incentive to the number of students who move from nonproficient to proficient levels in a subject area.
To understand how these types of incentives affect student learning, the committee looked at a synthesis of 14 studies of state-level incentive programs for schools before NCLB, as well as two studies on the impact of NCLB itself. Across subjects and grade levels, the research indicates an effect size of about 0.08 on student learning—equivalent to raising a student’s performance from the 50th to the 53rd percentile—when evaluated using the NAEP, a low-stakes test. The positive effect was strongest for fourth-grade mathematics.
Incentives for teachers. Many programs in this category are simply too new to be meaningfully evaluated, but those that have been assessed reveal effects even smaller than those for school-based incentives. One program that has existed long enough to be evaluated is a Nashville-based program that offered teachers bonuses of $5,000 to $15,000 for improvements in their students’ test scores. The proportion of participating teachers who received a bonus increased from one-third in the first year to one-half in the third year. However, over three years and four grades, researchers found an average effect size of .04 standard deviations on the high-stakes test, which was not statistically significant.
Another initiative, the Teacher Advancement Program, is a nationwide, foundation-developed program that offers teachers bonuses of up to $12,000 based on students’ test score gains; it also offers professional development to teachers. As of 2007, the program had been implemented in more than 180 U.S. schools. One evaluation of the program found no statistically significant effect on student test scores as measured by the high-stakes tests themselves. Another evaluation that looked at student math scores found that TAP schools increased test score gains on a low-stakes test by one to two points in grades 2-5, a statistically significant gain. In grades 6-10 the changes were either statistically insignificant or showed decreases of one to three points. Across grades, the average effect was 0.01 standard deviations.
One program did find significant positive results. A Texas program offered high-school teachers bonuses of $500 to $1,000 for each of their students who scored a 3 or higher (out of 5) on an Advance Placement (AP) exam; students also received smaller cash bonuses for a score of 3 or higher. The program included teacher training and a curriculum for earlier grades to prepare students to take AP classes. In schools that implemented the program, the number of students who scored at least an 1,100 on the SAT or a 24 on the ACT—in this context, the low-stakes tests—increased by two percentage points in the first year of the program and by one point each in the second and third years. By year three, enrollment in AP programs increased by 34%, and the number of students attending college increased by 5.3%.
Incentives for students. In an incentive experiment carried out over two years in New York City, fourth and seventh graders were offered cash rewards (up to $25 for the fourth graders and $50 for the seventh graders) based on scores on math and reading tests. Evaluators found that across eight combinations of subject and grade level, the average effect size on student learning was 0.01 when measured by New York state tests in reading and mathematics, a low-stakes test in this context because it offered no cash rewards.
The most widely used test-based incentives targeted at students are high-school exit exams, which are now required of almost two-thirds of public high-school students. Although the exams and subjects covered vary from state to state, students typically must pass tests in multiple subjects before they are awarded a high school diploma. The committee found that exit exams decrease the rate of high-school graduation by about two percentage points but do not increase student learning when measured by NAEP scores.
Incentive programs in other countries. The committee also examined six studies of incentive programs in India, Israel, and Kenya and found effects on achievement ranging from 0.01 to 0.19 standard deviations. However, most of the studies measured student achievement using the high-stakes tests attached to the incentives. Moreover, the India and Kenya programs were in developing countries where the educational context, which included high rates of teacher absenteeism and high student dropout rates in middle school, differed markedly from developed nations, making the studies’ lessons for the United States unclear.
Looking across all of the combinations of incentives, the committee found that when evaluated using low-stakes tests, incentives’ overall effects on achievement tend to be small and are effectively zero for a number of programs. Even when evaluated using the high-stakes tests attached to the incentives, a number of programs show only small effects.
The largest effects resulted from incentives applied to schools, such as those used in NCLB. Even here, however, the effect size of 0.08 is the equivalent of moving a student performing at the 50th percentile to the 53rd percentile. Raising student achievement in the United States to reach the level of the highest-performing nations would require a gain equivalent to moving a student at the 50th percentile to the 84th percentile. Unfortunately, no intervention has been demonstrated to produce an increase that dramatic. The improvement generated by school-based incentives is no less than that shown by other successful educational interventions.
However, although some types of incentives perform as well as other interventions, given the immense amount of policy emphasis that incentives have received during the past three decades, the amount of improvement they have produced so far is strikingly small. The study committee concluded that despite using incentives in various forms for 30 years, policymakers and educational administrators still do not know how to use them to consistently generate positive effects on student achievement and drive improvements in education.
Should policymakers give up on test-based incentives? Although the study’s findings do not necessarily mean that it is impossible to use incentives successfully, the small benefits they have produced so far suggest that they should be used with caution and carefully evaluated for effectiveness when they are used.
The study committee recommends a path of careful experimentation with new uses of incentives, combined with a more balanced approach to educational interventions. Evidence does not support staking so much of our hope for educational improvement on this single method; rather, it suggests that we should be moving some of our eggs out of the incentives basket and into other complementary efforts to improve education.
To those ends, the committee’s report urges that policymakers and educators take the following steps:
Experiment with using test-based incentives in more sophisticated ways, as one part of a richer accountability and improvement system. For example, some have proposed using school-based incentives with broader performance measures, such as graduation rates or measures of students’ project-based work. Others have proposed using test results as a “trigger” mechanism for fuller evaluations of schools. Under such a system, teachers or schools with low test scores would not automatically be sanctioned. Instead, the test results would identify schools that may need a review of their organizational and instructional practices and more intensive support for teachers.
Design tests in ways that discourage narrow teaching. The design of any incentive-based system should start with a description of the most valued educational goals, and the tests and indicators used should reflect those goals. However, the precise content and format should not remain the same over the years; a test that asks very similar questions from year to year and the same formats will become predictable and encourage teaching to the test. Even if the questions were initially an excellent gauge of student performance, over time the test scores are likely to become distorted as a result. To reduce the inclination to teach to the test, the tests should be designed to sample subject matter broadly and include continually changing content and item formats. Test items should be used only rarely and unpredictably.
Carefully evaluate the effectiveness of any new incentive program pursued. Incentives’ effectiveness may depend on the particular features of the program: whether it is schools, teachers, or students who are offered incentives, for example, and which tests or performance measures are used. These features should be carefully documented so that their effects can be considered when the program is assessed.
Consider using test scores in an informational role rather than attaching explicit rewards or sanctions. The policy discussion during the past decade has rested on the assumption that using tests in this way—to give educators and the public information about performance without explicit consequences—is not enough to produce change. But psychological research suggests that informational uses may be more effective in some situations to motivate students and educators.
Balance further experimentation with incentives with complementary efforts to improve other parts of the educational system. In continuing to explore options with test-based incentives, policymakers should keep in mind the costs of doing so. During the past two decades, substantial attention and resources have been devoted to using incentives in an attempt to strengthen education, an experiment that was worthwhile because it seemed to offer a promising route to improvement. Further investment still seems to be worthwhile because there are more sophisticated proposals for using test-based incentives that offer hope for improvement and deserve to be tried. But the available evidence on incentives’ effects so far does not justify a single-minded focus on them as a primary tool of education policy without a complementary focus on other aspects of the education system.
As policymakers continue to explore incentive approaches that have not been tried, they should avoid draining resources and attention away from other aspects of the education system, such as efforts to improve standards curricula, instructional methods, and teachers’ skills. Without these complementary efforts, no incentives are likely to work.
Michael Hout is the Natalie Cohen Chair of Sociology and Demography at the Berkeley Population Center, University of California, Berkeley. He chaired the study committee that produced the report from which this article is drawn, Incentives and Test-Based Accountability in Education, available from National Academies Press. Stuart Elliott directed the study and serves as director of the National Research Council’s Board on Testing and Assessment. Sara Frueh is a writer for the National Research Council.