4-min. read

What Counts as Compelling Evidence in Education?

6/30/2026

When it comes to evaluating education program evidence, one study is rarely enough. Here's how to think about the full body of research before making program decisions.

The strongest evidence base brings together multiple sources of evidence in a coherent way that helps educators make a responsible decision for a particular purpose and context.

Recently I have noticed a shift in how educators and district leaders are asking about the research that undergirds educational programs. The question used to be, “Does this work?” Increasingly, the question has become, “Do you have a randomized controlled trial?”

I understand why that question is being asked. Educators and education leaders want to make decisions grounded in strong evidence. They want to know whether the programs they choose are likely to help students learn. That is exactly the right instinct.

As someone who has spent my career in large-scale assessment, I think about evidence a little differently. In assessment, we do not decide whether test results can be trusted based on one study or one statistic. We look across many kinds of evidence from many sources to determine whether the results support the purpose for which the test is being used. Can educators trust what the assessment says about what students know and can do? Are the results useful for the decisions being made? Do the claims hold up across students, settings, and time?

This same perspective is useful when evaluating evidence for educational programs. It’s not a simple yes or no question. A better question is: How compelling is the full body of evidence, and does it support the decisions and actions the teacher, school, or district needs to make?

Strong Evidence Is a Portfolio, Not a Single Study

The strongest evidence in education is cumulative. It is built across multiple sources, methods, settings, and years of use. One study may answer an important question, but it rarely answers every question leaders need to ask before adopting, continuing, or scaling a program.

A strong evidence portfolio helps leaders understand not only whether a program can produce positive results, but also whether those results are likely to matter in their own context. Making an informed decision about a new program requires balancing considerations across many factors, including but not limited to potential impact, implementation, student populations, instructional fit, teacher use, and local conditions.

This matters because educational programs do not operate in isolation. They enter classrooms that already have curricula, instructional models, intervention structures, assessment systems, professional learning priorities, and student needs. A program may have evidence of impact and still be a poor fit if it does not cohere with the broader classroom or district ecosystem.

As such, the evidence question should not stop at “Does it work?” It should extend to “Does it work for the purpose we need, with the students we serve, under the conditions we can support, and in ways that strengthen the instructional system we are trying to build?”

Different kinds of evidence contribute to that judgment:

Experimental and quasi-experimental studies can help estimate program impact.
Large-scale studies can show whether findings hold across varied students and settings.
Implementation studies can reveal whether the program is being used as intended and what supports educators need to do so.
Local data can show whether the program is producing meaningful results in a particular district’s own classrooms.
Case studies that include both empirical results from a single school or grade, coupled with interviews, surveys, or focus groups can offer insights that simply analyzing large swaths of data with sophisticated statistics cannot.

Results that tell the same story over multiple replications with different students, grades, and locations are more compelling than a single snapshot.

What Education Leaders Should Ask about Evidence

When evaluating the research behind a curriculum, assessment, intervention, or educational technology program, leaders should ask questions that connect evidence to the specific intended purpose of the program.

Question	Why It Matters
What purpose is this program meant to serve?	Evidence should be evaluated in relation to the intended use. A program used for core instruction, supplemental practice, intervention, or progress monitoring may require different kinds of evidence.
What student outcomes does the evidence address?	Strong evidence should focus on outcomes that matter for learning, not only outcomes that are easiest to measure.
Who was included in the research?	Leaders need to know whether the evidence reflects students similar to those they serve, including students with different starting points, backgrounds, and learning needs.
Under what conditions was the program implemented?	A study conducted with extraordinary training, coaching, or monitoring may not tell leaders enough about typical district implementation.
How was the program used by teachers and students?	Usage, fidelity, and engagement matter. If a program was only partially implemented, the evidence needs to be interpreted accordingly.
Does the program cohere with the instructional system?	Evidence is more meaningful when the program strengthens, rather than competes with curriculum, instruction, assessment, and professional learning.
Do findings converge across methods, settings, and time?	Consistent evidence from multiple sources provides a stronger basis for action than any single study.
Can we examine our own data over time?	Local evidence helps leaders understand whether the program is working in their own classrooms, with their own students, under their own conditions.

Asking questions like these help leaders evaluate whether the breadth of evidence is strong enough, relevant enough, and coherent enough to support their needs.

Why One Study Cannot Carry the Whole Evidence Burden

This brings us back to randomized controlled trials (RCTs). RCTs can be valuable. When well designed and well implemented, they can provide strong evidence that a program caused a difference in outcomes under the conditions of the study. That is important.

But it is also incomplete. An RCT can tell us something meaningful about impact in a particular set of conditions. It does not automatically tell us whether the program will work equally well across different classrooms, with different students, different teachers, different implementation supports, and different local constraints.

That is why I find the current fixation on RCTs understandable but too narrow. The desire for stronger evidence is exactly right. The narrowing of that desire to one study design is the problem.

RCTs are challenging to design and execute in a way that works well in the complexities of a classroom. The threats to the validity of RCTs have been well documented over decades and many education researchers will attest that the complexities of designing an RCT that yields generalizable results are vast. In response to these challenges, the field developed and has come to rely on quasi-experimental designs (QEDs). QEDs are a rigorous set of methodologies designed to evaluate impact in the complexities of the real world. Evidence from well-designed QEDs should not be thought of as less compelling than evidence from RCTs. In fact, I would argue the opposite. Frequently, the requirements and constraints of RCTs create such artificial conditions as to not be generalizable to real-life settings.

A Higher Standard for the Evidence Conversation

We should hold educational programs to a high standard of evidence that they do what they claim to do. That means asking whether there is credible evidence of impact, whether findings hold across settings and student populations, whether implementation conditions are realistic, and whether the program supports strong teaching and learning.

It also means resisting the impulse to reduce evidence quality to a single yes/no question. The strongest evidence base brings together multiple sources of evidence in a coherent way that helps educators make a responsible decision for a particular purpose and context.

That is the evidence conversation education leaders deserve. More importantly, it is the evidence conversation students deserve.

Subscribe to Our Blog

Stay up to date on the latest research, strategies, and insights from Curriculum Associates and get new posts delivered straight to your inbox monthly.

About the Author

Kristen Huff, M.Ed., Ed.D., is Head of Measurement at Curriculum Associates, where she works with a team of assessment designers, psychometricians, data scientists, and researchers developing online assessments integrated with personalized learning and teacher-led instruction. Previously, she served as a senior fellow with the New York State Education Department and held leadership roles at several major assessment organizations. Kristen has more than 25 years of experience in K–12 large-scale assessment design and validation and has presented and published widely in educational measurement. She served as a technical advisor for the 2026 NAEP Frameworks in Reading and Mathematics and is first author of a chapter on assessment design in Educational Measurement, 5th Edition (Oxford University Press).

What Counts as Compelling Evidence in Education?

Strong Evidence Is a Portfolio, Not a Single Study

What Education Leaders Should Ask about Evidence

Why One Study Cannot Carry the Whole Evidence Burden

A Higher Standard for the Evidence Conversation

Related Content

What Rural Schools Taught Me about Leadership, Relationships, and Possibility

Mind the Gap: Using i-Ready to Bridge the Elementary-to-Middle School Math Divide