The Help That Hurts: Inside the Fundamental Unsolved Problem in Learning Science

The computer knows you’re stuck.

You’ve been staring at the algebra problem for forty seconds. You made one attempt. It was wrong. The system logged your error, updated its probability estimate of your knowledge state, and is now making a decision that generations of teachers have made by instinct, by exhaustion, by classroom management necessity, and occasionally by genuine pedagogical theory.

How much should it tell you?

This is not a minor design question. It is, according to two Carnegie Mellon researchers who have spent their careers building and breaking intelligent tutoring systems, “the fundamental open problem in learning and instructional science.” Kenneth Koedinger and Vincent Aleven published their accounting of it in 2007. The problem they named — the assistance dilemma — has not been solved since. It may not be solvable in the way we want it to be solved.

Here is what the dilemma actually is: giving students information helps them avoid errors and confusion. Withholding information forces them to generate and retrieve knowledge themselves, which appears to produce more durable learning. Both sides have theory behind them. Both sides have experimental support. Neither side has a principled account of when it is correct.

Every teacher navigates this dilemma dozens of times per class period. Every parent navigating homework with a frustrated child navigates it. Every tutoring system, intelligent or otherwise, is built on some implicit resolution of it. And the honest answer, after decades of controlled experiments in real classrooms, is that we do not have the resolution.

The Two Sigma Problem, and Why It Doesn’t Answer the Question

In 1984, Benjamin Bloom published a finding that became a kind of founding myth for educational technology. Expert one-on-one human tutoring, he reported, produced learning improvements two standard deviations above conventional classroom instruction. Two sigma. In a field where 0.4 sigma is considered a meaningful effect, two sigma is transformational.

The implication seemed obvious: build systems that approximate expert tutors. The Cognitive Tutor project at Carnegie Mellon was one of the most serious attempts to do exactly that. Built on John Anderson’s ACT-R theory of cognition, Cognitive Tutors model student knowledge as a set of production rules — discrete if-then reasoning patterns — and track each rule individually. The system presents problems, evaluates each step against the production rule model, provides immediate feedback when steps are wrong, offers hints on demand, and moves students to new topics only when mastery criteria are met for each rule. It is, in structural terms, a plausible approximation of what a patient, expert tutor does.

It works. This is worth stating clearly, because what follows can sound like criticism of a failure, and it is not. By the time Koedinger and Aleven were writing, Cognitive Tutors were deployed in over 2,000 schools across the United States. Algebra students taught with the Cognitive Tutor curriculum scored 15 to 25 percent higher on standardized test items. On items requiring genuine problem-solving and representational reasoning — not just procedural recall — the advantage was 50 to 100 percent. Multiple third-party replications confirmed the gains. The LISP tutor produced 30 to 43 percent higher learning with 30 to 64 percent greater efficiency compared to standard programming environments.

These are not ambiguous numbers.

But here is what they do not tell you: why it works. The Cognitive Tutor curriculum differs from comparison conditions in content, sequencing, classroom practice, and software features simultaneously. To find out which specific interactive features are responsible — to test the design philosophy, not just the product — you need a different kind of experiment. Which is precisely what the rest of the paper provides.

What the Experiments Actually Show

The controlled experiments within Cognitive Tutor environments tell a story so consistent it almost looks like a conclusion.

Ryan Corbett and John Anderson varied the timing of feedback in a LISP tutor: immediate correctness information versus flagged errors the student could investigate versus on-demand feedback versus no feedback at all. Learning outcomes were essentially equal across the three feedback conditions. Time to completion was not. Students with immediate feedback completed the problem set three times faster than students with no feedback.

Three times faster. Not three times more learned — three times faster at learning the same amount.

Jean McKendree’s experiments with feedback content pushed further. Students who received explanatory feedback — information about which goal they were working toward and which condition their error violated — made fewer errors on post-tests than students who received only yes/no correctness information. Richer information, better learning.

Hint content followed the same pattern. Principle-based explanatory hints produced more efficient learning than hints that simply gave away the answer. Mastery-based adaptive problem selection — giving students more practice on rules they haven’t mastered, less on rules they have — outperformed fixed problem sets in learning gains.

Every comparison points the same direction: more information, sooner, wins. Give the feedback. Make it explanatory. Adapt the sequence. The data appears to say: resolve the assistance dilemma in favor of giving assistance.

Koedinger and Aleven flag the qualification in language that deserves to be read carefully.

“This conclusion should not be interpreted as sweeping support for information giving in general, because it is important to recall that these strategies were evaluated in a context in which students were engaged in active problem solving, which is an important kind of information/assistance withholding.”

Read that again. The information-giving advantage only holds within an interaction structure that already withholds the most important information of all: the solution. The experiments are answering a narrow question — given that students are doing the work of solving problems themselves, how much help should the system provide when they struggle? The answer is: quite a lot. But whether students should be solving problems at all, rather than studying examples or watching demonstrations, is a question these experiments do not address.

The dilemma hasn’t been resolved. It has been relocated.

The Rational Cheater

Suppose you are placed in front of a Cognitive Tutor. You have gaps in your prior knowledge that make the first problem genuinely incomprehensible to you. You cannot generate a first step. You are not avoiding effort; you simply do not have enough knowledge to productively attempt the problem.

The system offers hints. The hints are designed to be scaffolded — general principle first, then more specific guidance, and finally a bottom-out hint that simply tells you the answer. The intended path is sequential: read the principle, try again, read the specific guidance, try again, use the bottom-out hint only when all else fails.

You skip straight to the bottom-out hint. You do this for every step of the problem. You have, in the language of the field, “gamed the system.”

Ryan Baker and colleagues documented this behavior pattern systematically. Approximately 10 percent of students in studied samples engaged in gaming — rapid guessing to trigger correctness feedback, immediate use of bottom-out hints, circumvention of the scaffolded hint sequence. Gaming correlated strongly with poor learning outcomes.

The standard interpretation is that gaming reflects disengagement, poor metacognition, or deliberate avoidance of effortful thinking. This interpretation is probably correct for some students. But Koedinger and Aleven offer a different reading, and it is more uncomfortable.

Some gamers, they suggest, may be behaving rationally. They lack the foundational knowledge to productively generate steps, so they use the hint system to reconstruct worked examples that they can then study. Crowley and Medvedeva found medical students who engaged in gaming-like behavior early in a curriculum and showed greater independent success on later problems — a pattern consistent with using early-problem hints as examples before developing independent competence.

If this interpretation is correct — and it is consistent with the evidence, if not proven by it — then gaming is not a metacognitive failure. It is adaptive behavior that emerges when the system’s information-withholding design is premature for a student’s current knowledge state.

The student is solving the assistance dilemma themselves. Because the system didn’t.

This is the most quietly devastating finding in the paper. The architecture designed to give students exactly the right amount of help at exactly the right moment is, for a meaningful subset of students, failing badly enough that they reverse-engineer it into a different kind of learning tool. The carefully scaffolded hint sequence becomes a worked-example generator. The interactive problem-solving environment becomes a passive demonstration system. The design intent is subverted not by malice but by necessity.

Can this continue? It cannot. Here’s why it matters.

The Intelligent Novice

Scaffolding Savvas Mathan and Koedinger ran an experiment that deserves more attention than it receives in the paper’s architecture. Standard tutoring systems model desired performance against an expert target — correct answers, clean solution paths, no errors. Mathan built a different kind of model.

The “intelligent novice” tutor evaluated student performance against a model of novice reasoning that included not just errors, but error detection and correction. A student who made an Excel formula error and then caught it was not penalized in the same way as a student who made the same error and moved on. The system’s feedback was calibrated to the arc of the error, not just its presence.

Students in the intelligent novice condition showed better learning on all measures — immediate and robust, declarative and procedural.

The mechanism is unclear, and Koedinger and Aleven are honest about this. Learning curve analysis suggested the benefit appeared immediately rather than accumulating over time, which rules out a gradual metacognitive improvement explanation. But the alternative mechanism — that seeing downstream consequences of errors helps students rationalize and repair underlying misconceptions — is proposed, not proven.

What is clear is the design implication. Redefining the standard of desired performance to include the novice’s actual developmental path, rather than demanding expert-level initial accuracy, produces better learning. The system stops penalizing students for being where they are. It meets them there, and the meeting point is calibrated to where they actually stand.

This is the intelligent resolution of the gaming problem that Roll et al.’s metacognitive instruction did not achieve. Roll’s help-seeking tutor reduced poor help-seeking behaviors during instruction. Geometry learning did not improve. The behavioral change did not transfer to domain competence. Teaching students to use the system correctly, when the system is mismatched to their knowledge state, is insufficient. Matching the system to the students works.

What Seventeen Years of Experiments Have Proven — and What They Haven’t

The assistance dilemma, Koedinger and Aleven argue, requires quantitative resolution: threshold parameters specifying when to give versus withhold, analogous to Philip Pavlik’s estimate that optimal spacing practice occurs in the 5 to 25 percent error rate range for fact association. This is the right framing. The field needs not just directional findings but calibrated criteria.

What the experiments establish, with reasonable confidence:

Within tutored problem-solving, immediate explanatory feedback outperforms delayed or absent feedback. Mastery-based adaptive sequencing outperforms fixed problem sets. Adding worked examples to an already-interactive tutor typically adds nothing — the tutor already provides, dynamically, what examples provide statically. Metacognitive instruction that changes help-seeking behavior does not reliably improve domain learning.

What the experiments leave unresolved:

The boundary conditions for when worked examples help (Mathan’s 2-of-4 result is inconclusive; Schwonke’s faded-example benefit and McLaren’s null result cannot be reconciled without a direct comparison). The mechanism of the intelligent novice tutor’s success. Whether the assistance dilemma can be resolved with universal parameters or requires condition-specific specification that varies by domain, student prior knowledge, and task structure.

The experiments reviewed by Koedinger and Aleven address the assistance dilemma at two scales: within a problem step (when to give correctness feedback) and within a problem (when to offer hints). They say almost nothing about the dilemma at larger scales: across problems (how to sequence examples and practice), across a curriculum (how to order topics), across instructional modes (when tutored practice is better than collaborative activity, demonstration, or independent reading).

The deeper problem is this. The experiments are run within Cognitive Tutors, on Cognitive Tutor curricula, with populations of students who are already enrolled in courses using these systems. The findings are internally valid. Their external validity — their transferability to other domains, other system architectures, other student populations — is assumed more than demonstrated.

Koedinger and Aleven know this. Their conclusion is not a claim to have solved the dilemma. It is a claim to have mapped it more precisely — to have identified where the answer is not, and to have built tools that bring the field closer to finding it.

The Question That Remains

You are still staring at the algebra problem. Forty seconds have become ninety. You made a second attempt. Also wrong.

The system is running Bayesian knowledge tracing. It has updated its estimate of the probability that you have mastered the relevant production rule. The estimate has fallen. The system is deciding whether to offer a hint.

Here is what the research tells us: if it offers a hint, the hint should be explanatory, not just correctional. If you ask for additional hints, the sequence should scaffold toward the answer, not jump directly to it. If you can’t generate any productive attempt, the information-withholding architecture may be premature for your current knowledge state, and the system may be about to fail you in a way that will turn you into a rational cheater.

Here is what the research does not tell us: the exact threshold at which withholding becomes harmful. The precise knowledge-state below which worked examples should replace problem-solving. The conditions under which faded examples bridge better than interleaved ones. Whether the benefit you would receive from a hint is the same benefit a different student in a different domain at a different point in their learning would receive from the same hint.