You re-read the chapter. You highlighted the key sentences. You feel like you know it.

You don't.

Roediger & Karpicke proved this in 2006 at Washington University. Students who re-read material were more confident in their learning — and performed worse on the test a week later. Students who quizzed themselves felt less confident but remembered dramatically more.

This is the metacognitive illusion at the heart of learning design. The strategy that feels effective is the strategy that fails. The strategy that feels difficult is the one that works.

And now we have the numbers to prove exactly how much it works.

Three Meta-Analyses, One Number

Between 2014 and 2021, three independent research groups ran comprehensive meta-analyses on the testing effect — the phenomenon where retrieving information from memory (via quizzes, tests, recall exercises) produces better learning than passively re-studying the same material.

They all found the same thing.

Rowland (2014) — Published in Psychological Bulletin. 61 studies, 159 effect sizes. Testing vs. restudying: g = 0.50 [95% CI: 0.42, 0.58].

Adesope, Trevisan & Sundararajan (2017) — Published in Review of Educational Research. 188 experiments, 272 independent effect sizes. The broadest analysis to date. Testing vs. restudying: g = 0.51.

Yang, Luo, Vadillo, Yu & Shanks (2021) — Published in Psychological Bulletin. Specifically focused on classroom quizzing. Overall effect on academic achievement: g = 0.50.

Three different teams. Three different methodologies. Three different publication years. The same number: half a standard deviation improvement from testing over re-reading.

This convergence is rare in psychology. It's the kind of evidence that survives replication crises.

What the Numbers Actually Mean

A Hedges' g of 0.50 is a medium-to-large effect. In practical terms: if you quiz yourself on material instead of re-reading it, you'll remember roughly 50% more of it on a delayed test.

But that's the baseline. The effect gets larger under specific conditions.

With feedback: g = 0.73. Rowland (2014) found that when people received feedback after quiz questions (seeing the correct answer), the effect nearly doubled. Without feedback: g = 0.39. With feedback: g = 0.73. This is one of the most actionable moderators in the literature — always show the correct answer.

In classrooms: g = 0.67. Adesope et al. (2017) found that classroom settings produced larger effects than laboratory settings. This is unusual — lab effects typically shrink in the real world. But quizzing in actual educational contexts seems to benefit from additional motivational and social factors.

Combined with spacing: g = 0.74. Latimier, Peyre & Ramus (2021) meta-analyzed spaced retrieval practice — quizzing distributed over time rather than bunched together. The combination of spacing and testing produced a large effect (g = 0.74), bigger than either technique alone.

Versus doing nothing: g = 0.93. Adesope et al. found that compared to no activity at all, practice testing produced nearly a full standard deviation improvement. If your course has no quizzes, no recall practice, nothing — the gap is enormous.

The Landmark Studies

Roediger & Karpicke (2006) — Published in Psychological Science (not Science magazine, as sometimes miscited). Students read prose passages, then either restudied or took free-recall tests without feedback. At 5 minutes, the restudy group performed better. At 1 week, the testing group dominated. Forgetting rates tell the story: the repeated study group forgot 56% of originally recalled material. The repeated test group forgot only 13%.

Karpicke & Blunt (2011) — This one made it to Science. Retrieval practice outperformed elaborative concept mapping by d = 1.50 — one and a half standard deviations. Students predicted concept mapping would work just as well. They were wrong. The advantage held even when the final test was itself a concept map. Testing didn't just help with "test-like" tasks — it produced better learning across formats.

(Note: the d = 1.50 from the original study, with N = 80, is likely an overestimate. The meta-analytic average of g = 0.50 is more representative. But the directional finding — retrieval practice beats elaborative study — has held up in replications.)

Why It Works (and What We Don't Know)

The honest answer: we know that retrieval practice works robustly, but the mechanistic picture is incomplete. Three hypotheses compete and overlap:

The Retrieval Effort Hypothesis (Pyc & Rawson, 2009): The harder the retrieval, the more durable the memory. Free recall (high effort) produces larger effects than recognition (low effort). This aligns with Bjork's "desirable difficulties" framework — conditions that feel harder during practice produce better long-term outcomes.

The Elaborative Retrieval Hypothesis (Carpenter, 2009): Searching memory activates related concepts, creating richer, more interconnected representations. During re-reading, no search occurs — the answer is right there. After testing, information is connected to more retrieval routes.

The Mediator Effectiveness Hypothesis (Pyc & Rawson, 2010): Failed retrieval prompts the learner to find better memory links. Subsequent study then strengthens these improved connections.

These hypotheses aren't mutually exclusive. Recent work (npj Science of Learning, 2024) suggests they operate together — effortful retrieval triggers elaborative processing, and both contribute to the effect.

The Complexity Debate

Van Gog & Sweller (2015) argued that the testing effect "decreases or even disappears as the complexity of learning materials increases." They framed this through cognitive load theory — highly complex material with many interacting elements overwhelms the retrieval benefit.

Karpicke & Aue (2015) rebutted this sharply. They pointed out that Van Gog & Sweller never provided a quantitative metric for "element interactivity," never experimentally manipulated complexity, and omitted relevant studies showing testing effects with complex materials.

The moderate position (Rawson, 2015): the testing effect holds for complex text materials. Very complex procedural skills are less studied. The weight of evidence does not support abandoning practice testing for complex content.

Transfer: The Honest Limitation

Here's where honesty matters. Pan & Rickard's 2018 meta-analysis (122 experiments, N = 10,382) specifically examined whether testing helps transfer — applying knowledge to new contexts, not just remembering the same material.

The answer: d = 0.40 [0.31, 0.50]. Real, but meaningfully smaller than the retention effect.

Transfer worked best for application and inference questions, medical diagnosis problems, and tests involving related concepts. Without favorable conditions, the transfer effect approached zero.

Testing is not a magic bullet for deep transfer. It's excellent for retention and good for transfer when designed thoughtfully — meaning the quiz questions need to resemble how the knowledge will actually be applied.

The Metacognitive Trap

This is the finding that matters most for anyone designing learning experiences.

Roediger & Karpicke (2006): Students who restudied were more confident in their knowledge. Students who tested themselves were less confident. The confident group performed worse. The less-confident group performed better.

Karpicke & Blunt (2011): Students predicted concept mapping would work as well as retrieval practice. Wrong again.

This is a systematic illusion. Re-reading produces fluency — you recognize the material, it feels familiar, and familiarity gets mistaken for mastery. Quizzing produces effort and uncertainty — you struggle to recall, it feels uncomfortable, and that discomfort gets mistaken for failure.

The strategy that feels like learning isn't learning. The strategy that feels like struggling is.

Bjork's "desirable difficulties" framework names this pattern: conditions that produce rapid performance during training (massed practice, immediate feedback, re-reading) often fail to produce long-term retention. Conditions that support durable learning (spacing, testing, interleaving) feel harder and produce slower apparent progress.

A difficulty is only "desirable" if the learner has enough background knowledge to engage with it. If quizzes are too hard and retrieval success is very low, the effect diminishes (Rowland, 2014). The sweet spot is challenging but achievable — roughly 75%+ retrieval success on initial quizzes.

What This Means for Online Learning

The testing effect works in digital contexts, but effect sizes tend to be somewhat smaller. An in-lecture quiz study found d = 0.37 in realistic online settings versus the g = 0.50 meta-analytic average in labs.

Key moderators for digital quiz design:

  • Distractions reduce the benefit
  • Feedback amplifies the benefit
  • Question format matters (retrieval-based questions > recognition-based)
  • Self-paced quiz frequency predicts outcomes (Bognar et al., 2021: students who self-quizzed 2-3x more achieved higher grades)
  • Embedding quizzes within content is effective

The research is clear: courses with quizzes produce better learning outcomes than courses without them. The challenge is overcoming learner resistance — people prefer the fluent feeling of re-reading over the effortful feeling of self-testing.

The data says that resistance is precisely backward.

The Convergence

Three meta-analyses. Same number. g = 0.50.

With feedback: g = 0.73. Combined with spacing: g = 0.74. In classrooms: g = 0.67.

Over 100 years of research, from Gates (1917) to Yang et al. (2021). Pre-registered replications that held up through the replication crisis.

The testing effect is not a trend. It's not a pedagogical fad. It's one of the most robust findings in cognitive science, backed by the kind of converging evidence that most psychological claims can only dream of.

If you're designing any kind of learning experience — a course, a training program, an onboarding sequence — and you're not including retrieval practice, you are leaving roughly half a standard deviation of learning on the table.

That's not a marginal improvement. That's the difference between your learners remembering 13% less versus 56% less a week later.

The quiz isn't the assessment. The quiz is the learning.

---

References: Roediger & Karpicke (2006) Psychological Science; Karpicke & Blunt (2011) Science; Rowland (2014) Psychological Bulletin; Adesope et al. (2017) Review of Educational Research; Pan & Rickard (2018) Psychological Bulletin; Yang et al. (2021) Psychological Bulletin; Latimier et al. (2021) Educational Psychology Review; Bjork & Bjork (2011); Van Gog & Sweller (2015) and Karpicke & Aue (2015) Educational Psychology Review; Pyc & Rawson (2009, 2010); Carpenter (2009).