The Hazards of High-Stakes Testing
Hyped by many as the key to improving the quality of education, testing can do more harm than good if the limitations of tests are not understood.
With the nation about to embark on an ambitious program of high-stakes testing of every public school student, we should review our experience with similar testing efforts over the past few decades so that we can we benefit from the lessons learned and apply them to the coming generation of tests. The first time that there was a large-scale commitment to accountability for results in return for government financial assistance was in the 1960s, with the beginning of the Title I program of federal aid to schools with low-income students. The fear then was that minority students, who had long been neglected in the schools, would also be shortchanged in this program. The tests were meant to ensure that the poor and minority students were receiving measurable benefits from the program. Since that time, large-scale survey tests have continued to be used, providing us with a good source of data to use in to determine program effects and trends in educational achievement.
Critics of testing often argue that the test scores can sometimes provide an inaccurate measure of student progress and that the growing importance of the tests has led teachers to distort the curriculum by "teaching to the test." In trying to evaluate these claims, we need to look at the types of data that are available and their reliability. In other words, what we know and how we know it. For example, when people claim that there is curriculum distortion, they are often relying on surveys of teachers' perceptions. These data are useful but are not the best form of evidence if policymakers believe that teachers are resisting efforts to hold them accountable. More compelling evidence about the effects of testing on teaching can be obtained by looking directly for independent confirmation of student achievement under conditions of high-stakes accountability. Early studies revealed very quickly that the use of low-level tests produced low-level outcomes. When students were evaluated only on simple skills, teachers did not devote time to helping them develop higher-order thinking skills. This was confirmed in the well-known A Nation at Risk report in the early 1980s and about a decade later in a report from the congressional Office of Technology Assessment.
In 1991, I worked with several colleagues on a validity study to investigate more specifically whether increases in test scores reflected real improvements in student achievement. In a large urban school system in a state with high-stakes accountability, random subsamples of students were given independent tests to see whether they could perform as well as they had on the familiar standardized test. The alternative, independent tests included a parallel form of the commercial standardized test used for high-stakes purposes, a different standardized test that had been used by the district in the past, and a new test that had been constructed objective-by-objective to match the content of the high-stakes test but using different formats for the questions. In addition to content matching, the new test was statistically equated to the high-stakes standardized test, using students in Colorado where both tests were equally unfamiliar. When student scores on independent tests were compared to results on the high-stakes accountability test, there was an 8-month drop in mathematics on the alternative standardized test and a 7-month drop on the specially constructed test. In reading, there was a 3-month drop on both the alternative standardized test and the specially constructed test. Our conclusion was that "performance on a conventional high-stakes test does not generalize well to other tests for which students have not been specifically prepared."
At the same time that researchers addressed the validity of test score gains, studies have also been done to examine the effect of high-stakes accountability pressure on curriculum and instructional practices. These studies, which involved large-scale teacher surveys and in-depth field studies, show that efforts to improve test scores have changed what is taught and how it is taught. In elementary schools, for example, teachers eliminate or greatly reduce time spent on social studies and science to spend more time on tested subjects.
More significantly, however, because it affects how well students will eventually understand the material, teaching in tested subjects (reading, math, and language arts) is also redesigned to closely resemble test formats. For example, early in the basic-skills accountability movement, Linda Darling-Hammond and Arthur Wise found that teachers stopped giving essay tests as part of regular instruction so that classroom quizzes would more closely parallel the format of standardized tests given at the end of the year. In a yearlong ethnographic study, Mary Lee Smith found that teachers gave up reading real books, writing, and long-term projects, and focused instead on word recognition, recognizing spelling errors, language usage, punctuation, and arithmetic operations. Linda McNeil found that the best teachers practiced "double-entry bookkeeping," teaching students both what they needed for the test and the real knowledge aimed at conceptual understanding. In other cases, test preparation dominated instruction from September until March. Only after the high-stakes test was administered did teachers engage the real curriculum such as Shakespeare in eighth-grade English. These forms of curriculum distortion engendered by efforts to improve test scores are strongly associated with socioeconomic level. The poorer the school and school district, the more time devoted to instruction that resembles the test.
I believe that policymakers would benefit from seeing concrete examples of what students can and cannot do when regular teaching closely imitates the test. One high-stakes test for third graders included a math item showing three ice cream cones. The directions said to "circle one-third of the ice cream cones." Correspondingly, the district practice materials included an item where students were to circle one-third of three umbrellas. But what we have learned from research is that many students who have practiced this item only this way cannot necessarily circle two-third of three ice cream cones, and most certainly cannot circle two-thirds of nine Popsicle sticks.
Other systematic studies show dramatically what students don't know when they learn only the test. In a randomized experiment conducted by Marilyn Koczer, students were trained exclusively to translate either Roman to Arabic numerals or Arabic to Roman. Then random halves of each group were tested on their knowledge using either the same order as their original training or the reverse order. Students who were tested in reverse order from how they had practiced, were worse off by 35 to 50 percentile points, suggesting that the high test performance for those tested in the same order as practiced does not necessarily reflect deep or flexible conceptual understanding.
We also have to be careful in listening to discussions of alignment between the curriculum and the test. It is not enough that each item in the test correspond to some standard in the curriculum. To be useful, the test items must cover a wide array of standards throughout the curriculum. Many teachers will teach to the test. That's a problem if the test is narrowly structured. If the test covers the full domain of the curriculum, then there is no great harm in teaching to the test's content. But there still can be a problem if students are trained to answer questions only in multiple-choice format. They need to be able to write and reason using the material.
The setting of performance standards, which is usually done out of sight of the public, can have a powerful effect on how the results are perceived. Texas made two interesting choices in setting its standards. It wisely made the effort to coordinate the standards across grades. For example, in setting the 10th-grade math standard, it also considered where to set the standard for earlier grades that would be necessary to keep a student on schedule to reach the 10th-grade standard. Although policymakers set the standard by saying they wanted students to know 70 percent of the basic-skills test items, this turned out to be the 25th percentile of Texas students. Selecting a low performance standard was wise politically, because it made it possible to show quick results by moving large numbers of students above this standard.
My state of Colorado made the educationally admirable but politically risky decision to set extremely high standards (as high as the 90th percentile of national performance in some areas) that only a pole-vaulter could reach. The problem is that it's hard to even imagine what schools could do that would make it possible to raise large numbers of students to this high level of performance. Unless the public reads the footnotes, it will be hard for it to interpret the test results accurately.
These political vicissitudes explain why psychometricians are so insistent on preserving the integrity of the National Assessment of Educational Progress (NAEP), which is given to a sample of children across the country and that teachers have no incentive to teach to, because the results have no direct high-stakes consequences for themselves or their students. The test's only purpose is to provide us with an accurate comparative picture of what students are learning throughout the country.
If states so choose, they can design tests that will produce results that convey an inflated sense of student and school progress. There may also be real gains, but they will be hard to identify in the inflated data. NAEP is one assessment mechanism that can be used to gauge real progress. NAEP results for Texas indicate that the state is making real educational progress, albeit not at the rate reflected in the state's own test. Texas is introducing a new test and more rigorous standards. Let's hope that it provides a more realistic picture.
There are signs that Congress understands the possibility that test data can be corrupted or can have a corrupting influence on education. Even more important, it has been willing to fund scientific research studies to investigate the seriousness of these problems. In 1990, Congress created the NAEP Trial State Assessment and concurrently authorized an independent evaluation to determine whether state assessments should become a regular part of the national assessment program. More recently, Congress commissioned studies by the National Research Council to examine the technical adequacy of proposed voluntary national tests and the consequences of using tests for high-stakes purposes such as tracking, promotion, and graduation. Even President Bush's new testing plan shows an understanding of the principle that we need independent verification of reported test score gains on state accountability tests.
The nation's leaders have long faced the problem of balancing the pressures to ratchet up the amount of testing with uncertainty about how to ensure the validity of tests. Ten years ago, many policymakers embraced the move toward more authentic assessments as a corrective to distortion and dumbing-down of curriculum, but it was then abandoned because of cost and difficulties with reliability. We should remember that more comprehensive and challenging performance assessments can be made equal in reliability to narrower, closed-form machine-scorable tests, but to do so takes more assessment tasks and more expensive training of scorers. The reliability of the multiple-choice tests is achieved by narrowing the curricular domain, and many states are willing to trade the quality of assessment for lower cost so that they can afford to test every pupil every year and in more subjects. Therefore, we will have to continue to evaluate the validity of these tests and ask what is missed when we focus only on the test. Policymakers and educators each have important roles to play in this effort.
Preserve the integrity of the database, especially the validity of NAEP as the gold standard. If we know that the distorting effects of high-stakes testing on instructional content are directly related to the narrowness of test content and format, then we should reaffirm the need for broad representation of the intended content standards, including the use of performance assessments and more open-ended formats. Although multiple-choice tests can rank and grade schools about as well as performance assessments can, because the two types of measures are highly correlated, this does not mean that improvements in the two types of measures should be thought of as interchangeable. (Height and weight are highly correlated, but we would not want to keep measuring height to monitor weight gain and loss.) The content validity of state assessments should be evaluated in terms of the breadth of representation of the intended content standards, not just "alignment." A narrow subset of the content can be aligned, so this is not a sufficient criterion by itself.
The comprehensiveness of NAEP content is critical to its role as an independent monitor of achievement trends. To protect its independence, it should be sequestered from high-stakes uses. However, some have argued that NAEP is already high-stakes in some states, such as Texas, and will certainly become more high-stakes if used formally as a monitor for federal funding purposes. In this case, the integrity of NAEP should be protected substantively by broadening the representation of tasks within the assessment itself (such as multiple-day extended writing tasks) or by checking on validity through special studies.
Evaluate and verify the validity of gains. Special studies are needed to evaluate the validity of assessment results and to continue to check for any gaps between test results and real learning. I have in mind here both scientific validity studies aimed at improving the generalizability of assessments and bureaucratic audits to ensure that rewards for high-performing schools are not administered solely on the basis of test scores without checking on the quality of programs, numbers of students excluded, independent evidence of student achievement, and so forth. Test-based accountability systems must also be fair in their inferences about who is responsible for assessment results. Although there should not be lower expectations for some groups of students than for others, accountability formulas must acknowledge different starting points; otherwise, they identify as excellent schools where students merely started ahead.
Scientifically evaluate the consequences of accountability and incentive systems. Research on the motivation of individual students shows that teaching students to work for good grades has harmful effects on learning and on subsequent effort once external rewards are removed. Yet accountability systems are being installed as if there were an adequate research-based understanding of how such systems will work to motivate teachers. These claims should be subjected to scientific evaluation of both intended effects and side effects, just as the Food and Drug Administration would evaluate a new drug or treatment protocol. Real gains in learning, not just test score gains, should be one measure of outcome. In addition, the evaluation of side effects would include student attitudes about learning, dropout rates, referrals to special education, attitudes among college students about teaching as a career, numbers of professionals entering and leaving the field, and so forth.
Many have argued that the quality of education is so bad in some settings, especially in inner-city schools, that rote drill and practice on test formats would be an improvement. Whether this is so is an empirical question, one that should be taken seriously and examined. We should investigate whether high-stakes accountability leads to greater learning for low-achieving students and students attending low-scoring schools (again as verified by independent assessments). We should also find out whether these targeted groups of students are able to use their knowledge in nontest settings, whether they like school, and whether they stay in school longer. We should also try to assess how many students are helped by this "teaching the test is better than nothing" curriculum versus how many are hurt because richer and more challenging curriculum was lost along with the love of learning.
Locate legitimate but limited test preparation activities within the larger context of standards-based curriculum. Use a variety of formats and activities to ensure that knowledge generalizes beyond testlike exercises. Ideally, there should be no special teaching to the test, only teaching to the content standards represented by the test. More realistically, very limited practice with test format is defensible, especially for younger students, so they won't be surprised by the types of questions asked or what they are being asked to do. Unfortunately, very few teachers feel safe enough from test score publicity and consequences to continue to teach curriculum as before. Therefore, I suggest conscientious discussions by school faculties to sort out differences between legitimate and illegitimate test preparation. What kinds of activities are defensible because they are teaching both to the standards and to the test, and what kinds of activities are directed only at the test and its scoring rules? Formally analyzing these distinctions as a group will, I believe, help teachers improve performance without selling their souls. For example, it may be defensible to practice writing to a prompt, provided that students have other extended opportunities for real writing; and I might want to engage students in a conversation about occasions outside of school and testing when one has to write for a deadline. However, I would resolve with my colleagues not to take shortcuts that devalue learning. For example, I would not resort to typical test-prep strategies, such as "add paragraph breaks anywhere" (because scorers are reading too quickly to make sure the paragraph breaks make sense).
Educate parents and school board members by providing alternative evidence of student achievement. Another worthwhile and affirming endeavor would be to gather alternative evidence of student achievement. This could be an informal activity and would not require developing a whole new local assessment program. Instead, it would be effective to use samples of student work, especially student stories, essays, videotapes, and extended projects as examples of what students can do and what is left out of the tests. Like the formal validity studies of NAEP and state assessments, such comparisons would serve to remind us of what a single test can and cannot tell us.
Of these several recommendations, the most critical is to evaluate the consequences of high-stakes testing and accountability-based incentive systems. Accountability systems are being installed with frantic enthusiasm, yet there is no proof that they will improve education. In fact, to the extent that evidence does exist from previous rounds of high-stakes testing and extensive research on human motivation, there is every reason to believe that these systems will do more to harm the climate for teaching and learning than to help it. A more cautious approach is needed to help collect better information about the quality of education provided in ways that do not have pernicious side effects.
J. J. Cannell, Nationally Normed Elementary Achievement Testing in America's Public Schools: How All50 States Are Above the National Average (Daniels, W. Va.: Friends for Education, ed. 2, 1987).
L. Darling-Hammond and A. E. Wise, "Beyond Standardization: State Standards and School Improvement," Elementary School Journal 85 (1985): 315-336.
R. J. Flexer, "Comparisons of student mathematics performance on standardized and alternative measures in high-stakes contexts," paper presented at the annual meeting of the American Educational Research Association, Chicago, Ill., April 1991.
J. P. Heubert and R. M. Hauser, High Stakes: Testing for Tracking, Promotion, and Graduation (Washington, D.C.: National Academy Press, 1999).
M. L. Koczor, Effects of Varying Degrees of Instructional Alignment in Posttreatment Tests on Mastery Learning Tasks of Fourth Grade Children (unpublished doctoral dissertation, University of San Francisco, San Francisco, Calif., 1984).
D. Koretz, R. L. Linn, S. B. Dunbar, and L. A. Shepard, "The effects of high-stakes testing on achievement: Preliminary findings about generalization across tests," paper presented at the annual meeting of the American Educational Research Association, Chicago, Ill., April 1991.
R. L. Linn, M. E. Graue, and N. M. Sanders, Comparing State and District Test Results to National Norms: Interpretations of Scoring "Above the National Average" (Los Angeles, Calif.: CSE Technical Report 308, Center for Research on Evaluation, Standards, and Student Testing., 1990).
L. M. McNeil, Contradictions of Control: School Structure and School Knowledge (London and Boston: Routledge and Kegan Paul, 1986).
L. M. McNeil, Contradictions of Reform: The Educational Costs of Standardization (New York: Routledge, 2000).
National Academy of Education, Assessing Student Achievement in the States: The First Report of the National Academy of Education Panel on the Evaluation of the NAEP Trial State Assessment: 1990 Trial State Assessment (Stanford, Calif.: 1992).
Office of Technology Assessment, U.S. Congress, Testing in American Schools: Asking the Right Questions (Washington, D.C.: OTA-SET-519, U.S. Government Printing Office, 1992).
L. A. Shepard and K. C. Dougherty, "Effects of high-stakes testing on instruction," paper presented at the annual meeting of the American Educational Research Association, Chicago, Ill., April 1991.
M. A. Smith, The Role of External Testing in Elementary Schools (Los Angeles, Calif.: Center for Research on Evaluation, Standards, and Student Testing, 1989).
Lorrie A. Shepard (email@example.com) is dean of the school of education at the University of Colorado at Boulder and a member of the National Research Council's Board on Testing and Assessment.