The Answer Machine

NOTE THE ORIGINAL PAPER:

Li, Andy Tao and Liu, De and Ye, Teng, Is ChatGPT a Boon or a Bane for Learning? Experimental Evidence Across Task Formats and Chatbot Designs (January 15, 2026). Available at SSRN: https://ssrn.com/abstract=5533921 or http://dx.doi.org/10.2139/ssrn.5533921

The score doesn’t lie, but it takes two numbers to tell the truth.

Here is the first number: students given access to standard ChatGPT score 7.14 points higher during practice than students limited to a search engine. They feel the momentum of understanding. They move through problems quickly. They finish early, confident.

Here is the second number: on the exam—paper, pencil, no tools—those same students score 4.18 points lower than the search-engine group.

Sit with that. The tool that helped most in the moment harmed most when it mattered.

This is the finding at the center of “Is ChatGPT a Boon or a Bane for Learning?”—a January 2026 working paper from researchers at the University of Science and Technology of China and the University of Minnesota. Jiani Li, Yidong Liu, and Yong Ye ran a pre-registered randomized controlled trial across 12 analytics courses, 583 students, 7,695 practice problems, and 7,831 exam items. They weren’t asking whether AI felt useful. They were asking whether it produced learning—and they designed a study careful enough to tell the difference.

The answer is precise: it depends on how the AI was built to respond.

The Machine That Thinks For You

Imagine you’re working through a statistics problem. You’re not sure how to set up the hypothesis test. You type the problem into ChatGPT.

It gives you the answer.

Not a hint. Not a framework. Not a question back. The full worked solution, fluent and confident, arrives in four seconds.

You read it. You understand it—or you experience something that feels like understanding, which is faster and more comfortable than the thing it resembles. You copy the method into your answer. You move to the next problem.

In the practice session, this works. Your score reflects the quality of GPT-4’s reasoning, which is considerable. You finish with high marks and a vague sense of competence.

Then comes the exam.

The exam is paper and pencil. There is no ChatGPT. The problem in front of you is similar to the one you practiced—matched on topic, difficulty, format. But there is a gap between you and the answer now, and you discover something: the cognitive step you skipped during practice was the cognitive step where learning happens. You outsourced it. It is not there.

Your exam score drops 4.18 points on average. For selected-response problems—multiple choice, true/false, the format where the answer is hidden among options you must evaluate—the penalty is steeper: −7.718 points (p < 0.001).

The researchers call this cognitive offloading. The LLM performed the intellectual labor. The student received the output. The brain registered comprehension. The exam revealed that comprehension was borrowed.

Two Hundred Words

Here is what the researchers did next. They did not build a new model. They did not fine-tune GPT-4, or commission a dataset, or spend six months designing a new interface.

They wrote a paragraph.

The GD-GPT system prompt—reproduced in the paper’s appendix—runs approximately 200 words. Its operative instructions: help students break down the problem. Provide essential background knowledge. Allow students to independently arrive at the answer. And then, the most important line: do not provide the final answer.

That constraint—an instruction to withhold—produced these results:

+9.23 exam points versus the search engine group (p < 0.001)
+13.41 exam points versus standard ChatGPT (p < 0.001)
Effects durable in post-experimental assessment weeks later

The same GPT-4 model. The same API. Different prompt. Thirteen points.

The theoretical framework is called guided discovery—a pedagogical approach documented in a 2011 meta-analysis by Alfieri et al. showing it outperforms both direct instruction and pure unguided exploration. The principle: give learners enough structure to orient, enough background to proceed, and then make them complete the thinking themselves. The completion is not optional. It is where learning occurs.

The prompt operationalizes one thing: the AI must make you work.

What Productive Struggle Actually Looks Like

The mechanism is measurable.

Students using GD-GPT spent 0.49 more minutes per problem—a 49% increase in time-on-task relative to BASE-GPT. They asked follow-up questions at 56.5% higher rates. They had 1.16 more conversation rounds per problem. And when researchers measured the semantic similarity between what the AI said and what the student ultimately submitted—using BERT similarity scores, ROUGE, and BLEU—the GD-GPT students diverged more from the model’s output.

They processed. They argued. They synthesized. They wrote something that was, measurably, their own.

But here is the finding that makes the mechanism undeniable. Look at Table 3 in the paper—the interaction between tool type and practice time.

For BASE-GPT: more time spent practicing produces no improvement in exam scores. The coefficient is −1.467, statistically indistinguishable from zero. Spend twice as long with standard ChatGPT and you learn no more. The time is cognitively inert. You are reading someone else’s thinking.

For GD-GPT: each additional minute of practice produces +4.305 exam points (p < 0.001).

The bottleneck was never duration. It was quality of engagement. “Spend more time studying with AI” is advice that only works if the AI forces you to think.

The Paradox of Feeling Smart

There is a crueler finding nested inside the data. Students know, at some level, that GD-GPT serves them better.

When surveyed on information quality, perceived usefulness, and learning value, they rated GD-GPT higher than BASE-GPT. They said they preferred it.

And then they used BASE-GPT 12.9% more often.

This is the paradox of the answer machine. The tool that makes you feel most productive—that delivers fluent, complete, immediately applicable solutions—generates the experience of mastery without its substance. It is easier to receive an answer than to earn one. Given the choice between a tool that challenges and a tool that accommodates, students in the study reached for accommodation.

The researchers measure what students said about BASE-GPT’s cognitive load: they rated it higher on “germane load”—the sense that material was challenging in a productive way. Students who were handed answers believed they were doing more rigorous work than students who were made to find answers themselves. This is the illusion of learning: effort and achievement are so routinely correlated that when the effort is removed, the sense of achievement remains.

This is not a failure of individual judgment. It is a feature of fluent, competent AI systems deployed without pedagogical constraint. The tool is designed to satisfy. Learning, by its nature, is not satisfying in the moment.

Who It Helps—And Who It Doesn’t

The paper’s most policy-relevant finding is buried in Table 8, reported without alarm.

GD-GPT benefits students with greater course familiarity and prior LLM experience significantly more than students without those advantages. The interaction coefficient on (GD-GPT × Course Familiarity) is +6.550 per unit of prior exposure (p < 0.01). Students who already know more, gain more.

Translate this: guided discovery requires a foundation to build on. The prompt says break down the problem, provide background, let them arrive at the answer—but if a student lacks the schema to interpret the background and traverse the gap to the answer, the withholding is not productive friction. It is confusion.

The study’s sample is analytics students at Chinese universities with a mean LLM familiarity score of 4.02 out of 5. These are among the most AI-literate students in the world. They are also the students for whom GD-GPT works best.

Can this continue to all students? Not without modification. Not without confronting the possibility that the students who most need help—who lack prior knowledge, who are least LLM-experienced, who are arriving at unfamiliar material—may benefit least from an approach that requires them to complete cognitively demanding steps without support.

The paper acknowledges this as a limitation. It is, in fact, a design constraint.

The Prompt That Changes Everything

Here is what the study proves, stated plainly.

A 200-word system prompt that withholds final answers while providing conceptual scaffolding produces measurable, durable learning gains in a field setting. It costs nothing per deployment. It requires no model retraining, no new infrastructure, no institutional contract. Any organization currently deploying standard ChatGPT for education can append two paragraphs to their system prompt and produce better learning outcomes.

That is the paper’s practical contribution. It is not modest.

What the study does not prove: that this scales to younger students, lower-income populations, non-analytics subjects, or learners without strong prior knowledge. The causal pathway—guided discovery prompt → critical engagement → durable learning—is theoretically coherent and behaviorally supported but not causally identified at each step. The mediation analyses are correlational. The sample is specific.

And yet.

The practice-exam divergence is real: the tool that helps you practice can be precisely the tool that prevents you from learning. That finding has been documented now in a pre-registered field experiment with half a thousand students across a dozen courses. It replicates Bastani et al. It extends their result with a remediation. It ships a solution.

Standard ChatGPT, used as a study aid, teaches students to receive answers. GD-GPT teaches them to generate them. The exam does not care which process you used. The exam only reveals whether the knowledge is yours.

Two numbers. Seven points up. Four points down.