Educational AI: Case Studies
A Critical Practitioner’s Guide to Artificial Intelligence in Learning Environments
Preface: Writing in Public
This book began as a Substack post.
That fact matters more than it might seem. It means the argument you’re about to read has already been tested — not in the controlled environment of peer review, where the audience is professional, the feedback is delayed by months, and the incentives favor diplomatic hedging, but in public, where readers arrive with their own classrooms and their own evidence and their own reasons to push back. Some of what I wrote in those early posts was wrong. Some of it was right but imprecisely framed. Some of it needed a better example. The readers told me. I revised.
This is the version that survived that process. It is not the final version. It may never be.
I want to be transparent about what that means for you as a reader. The chapters that follow represent my best current thinking on artificial intelligence in education — grounded in the evidence I can find, honest about what that evidence does and doesn’t establish, and willing to name what others leave unspoken. But the field is moving faster than any book can track. Claims I describe as preliminary may be replicated or refuted by the time you read this. Tools I describe as experimental may be standard practice. Institutions I describe as cautious may have reversed course entirely. Writing about AI in 2025 feels, some days, like annotating a river.
The public writing process was not a compromise. It was a method.
Here is what I learned from it. The most important feedback rarely comes from the people who agree with you. It comes from the classroom teacher in rural Mississippi who read my chapter on AI tutoring and wrote back to say that her students don’t have the bandwidth to run Khanmigo, that the equity argument assumes infrastructure that doesn’t exist, and that she needed something she could actually use on Monday. It comes from the statistics professor who pointed out that my citation of the 2-sigma finding was doing more argumentative work than the evidence warranted. It comes from the graduate student who caught a contradiction between Chapter 2 and Chapter 6 that I had managed not to see across multiple drafts.
Public writing creates accountability that private drafting doesn’t. When you commit an argument to Substack, you cannot later claim you were misunderstood. The post is there. The timestamp is there. The comments are there. This is uncomfortable in the way that honesty is often uncomfortable. I have found it clarifying.
A note on what this book is not. It is not a manifesto for AI adoption in education. It is not a warning against it. I have tried, in every chapter, to ask the question that gets asked least often in this debate: compared to what? AI tutoring compared to no tutoring, or compared to the best human tutoring available? AI writing assistance compared to a student staring at a blank page, or compared to a skilled writing instructor providing individualized feedback? The answer to “is AI good for education” depends almost entirely on what you are measuring, which students you are asking about, and what the alternative actually is. Anyone who gives you a confident answer without specifying those conditions is selling something.
What I am asking you to do, as a reader, is the same thing I ask of the case studies in every chapter: bring your evidence. If something I’ve written conflicts with what you’ve observed in your own classroom, your own district, your own research, tell me. The Substack is still active. The argument is still open.
That’s not a disclaimer. That’s the whole point.
The right question is not whether AI is perfect. The right question is whether AI, deployed thoughtfully, produces better outcomes than the system it supplements or replaces — and for whom. Every chapter in this book is an attempt to answer that question honestly, with the evidence available now, knowing that the evidence will change.
Write back when it does.
Tags: AI in education Substack, public intellectual iterative writing, educational technology critique, computational skepticism pedagogy, evidence-based AI deployment
A Critical Practitioner’s Guide to Artificial Intelligence in Learning Environments
Table of Contents
Preface: Reading These Cases with Computational Skepticism
This book does not assume AI in education is good or bad. It assumes AI is here, and that the relevant question is how to evaluate its deployment honestly. Each chapter pairs a central claim from current educational AI discourse with empirical scrutiny, practical implementation frameworks, and authentic case studies drawn from real classrooms, districts, and institutions. Readers are expected to bring the same skepticism to these case studies that they would apply to any instructional intervention: What is the evidence? What is the comparison condition? Who benefits, and under what circumstances?
Chapter 1: The Prohibition Paradox
Core Claim: Banning generative AI in educational settings is both futile and counterproductive. The more defensible response is policy design that distinguishes harmful use from productive use, coupled with instructional redesign that makes AI-assisted cheating less advantageous than genuine engagement.
Logical Method: Historical analogy + inevitability argument + harm reduction framing. The chapter tests whether the “genie is out of the bottle” premise justifies any specific integration strategy, or merely rules out outright prohibition.
Methodological Soundness: Prohibition bans (LAUSD, NYC, France, Australia) are documented, but their effectiveness is largely unmeasured. Rapid adoption data (ChatGPT reaching 1M users in one week) establishes scale, not educational impact. The video-streaming analogy underestimates the degree to which generative AI substitutes for rather than supplements student cognitive labor.
Use of LLMs to Explore: Students interact with Claude or GPT-4 to generate an essay on a given topic, then critically analyze the output for factual errors, logical gaps, and missing nuance. The exercise reframes the model as an object of critique rather than a shortcut — building AI literacy while reducing the incentive to submit AI output as personal work.
Use of Agentic AI to Explore: An AI agent is configured to simulate a “policy advisor” role, tasked with drafting an AI acceptable-use policy for a hypothetical school district. The agent must gather stakeholder perspectives (student, teacher, administrator, parent), synthesize conflicting priorities, and produce a tiered policy with implementation criteria. Students evaluate the agent’s output against real district policies.
Case Study: Seattle Public Schools — From Ban to Framework (2023) Seattle was among the first major districts to ban ChatGPT outright, then reversed course within months. This case traces the policy arc: what triggered the ban, what evidence informed the reversal, how the subsequent acceptable-use framework was constructed, and what gaps remain. Student and teacher interview data illuminate the gap between official policy and classroom reality.
Chapter 2: The Two-Sigma Dream
Core Claim: AI tutoring systems can approximate — and potentially exceed — the 2-sigma learning gains attributed to expert one-on-one human tutoring, delivering personalized, mastery-based instruction at scale to populations that have never had access to individualized instruction.
Logical Method: Bloom’s (1984) 2σ finding as aspirational target → Khan Academy and similar platforms as partial solutions → large language models as the missing adaptive component that closes the gap. The chapter interrogates each link in this chain.
Methodological Soundness: The 2σ figure is frequently cited but methodologically contested. Cohen et al. (1982) and VanLehn (2011) found human tutoring effects of 0.4σ and 0.79σ respectively. Khan Academy’s 20–60% “acceleration” claim lacks operationalization of the outcome measure, control condition specification, and accounting for selection bias. The chapter distinguishes demonstrating content knowledge (which GPT-4 does convincingly) from teaching a student to acquire content knowledge (which requires a different evidential standard entirely).
Use of LLMs to Explore: Students use a raw LLM (no special tutoring mode) to learn a concept they find genuinely difficult — a specific mathematical proof, a biological mechanism, a historical causation chain. They document their interaction in detail: where the model helped, where it confused them, where it gave plausible-sounding but incorrect explanations. This generates primary data on LLM tutoring effectiveness at the individual level.
Use of Agentic AI to Explore: An agentic tutoring pipeline is constructed using a Socratic prompting scaffold: the agent may not give direct answers, must probe for prior knowledge, must adapt its questioning based on student responses, and must flag when it detects a persistent misconception. Students compare their learning experience across three conditions: direct LLM query, agentic Socratic tutor, and human peer tutor. Outcomes are self-reported and independently assessed via a post-session quiz.
Case Study: Khanmigo Pilot — School City of Hobart, Indiana The Hobart district’s 30,000-student Khanmigo pilot is the largest documented real-world deployment of an AI Socratic tutor at time of writing. This case examines what was measured (self-confidence gains were the headline finding, not domain-specific academic growth), what that pattern of results suggests, and what a more rigorous evaluation design would require. The case draws on publicly available pilot reports and contextualizes the Hobart findings within the broader ITS (Intelligent Tutoring System) literature.
Chapter 3: Writing with the Machine
Core Claim: AI co-creation tools can scaffold the writing process — reducing blank-page paralysis, raising the floor of output quality, and enabling students to focus cognitive effort on argumentation and ideation rather than sentence mechanics. This claim must be distinguished from the separate claim that AI assistance improves students’ independent writing ability over time.
Logical Method: Taxonomic analysis: decompose what writing assignments are for (skill development, knowledge demonstration, cognitive processing, communication practice) → identify which purposes AI threatens vs. which it enables → design instructional responses for each category.
Methodological Soundness: Instructor observations that AI-assisted student papers are “objectively better” (Hick, Mollick) measure output quality, not skill acquisition trajectory. The crucial unanswered question is whether students who write with AI assistance write better independently after the experience. The 54% of Americans reading below 6th-grade level statistic establishes high stakes but is not connected to evidence that current AI writing tools address this population — whose needs differ fundamentally from those of the motivated international school students featured in most demonstrations.
Use of LLMs to Explore: Students complete a two-stage writing exercise. Stage 1: write a first draft entirely independently. Stage 2: use an LLM in an explicitly constrained way — permitted to request feedback and targeted suggestions but not to request full paragraph rewrites. Students then revise and submit a final draft with an attached “AI use log” documenting every interaction. The log becomes an object of class analysis.
Use of Agentic AI to Explore: A writing-coach agent is designed with explicit behavioral constraints: it must identify the student’s stated argument before offering any feedback, it must ask clarifying questions before suggesting structural changes, and it must never produce text the student can directly paste into their draft. Students use the agent across three different writing assignments and report whether the constraints felt productive or frustrating — and why.
Case Study: Yale and Wharton — Contrasting AI Writing Policies in Higher Education Professors Alexander Gilles Fuentes (Yale) and Ethan Mollick (Wharton) represent two coherent but distinct philosophies on AI in academic writing. This case documents both approaches, the rationale behind each, observable differences in student output and engagement, and what each philosophy implicitly assumes about the purpose of writing education at the university level.
Chapter 4: Socratic Circuits
Core Claim: AI tutoring systems can deliver effective Socratic questioning across STEM domains — maintaining an inquiry-based mode rather than lecturing, correctly handling pseudoscientific claims, and adapting question complexity to student responses. The claim is assessed separately for mathematics, life science, physical science, and quantitative reasoning.
Logical Method: Demonstration dialogues (Khanmigo vs. flat-earth claims; Khanmigo vs. climate denial; Khanmigo on GLP-1 Ozempic mechanism; Khanmigo on p-values) as existence proofs of best-case behavior + single district pilot + single university instructor report.
Methodological Soundness: Demonstration exchanges are cherry-picked by design. They show best-case system behavior, not a distribution of responses that would include hallucinations, Socratic breakdowns, and off-domain errors. The Intelligent Tutoring System literature (AutoTutor, GazeTutor) documents that AI tutors help some student populations substantially and others minimally, and that physics reasoning is a documented weak domain for LSA-based systems. These known failure modes are underweighted in popular discourse about AI STEM tutoring.
Use of LLMs to Explore: Students deliberately attempt to break an LLM’s Socratic mode — by giving obviously wrong answers repeatedly, by asking leading questions designed to extract direct answers, and by introducing genuine misconceptions drawn from documented student error patterns in physics and statistics. They document which strategies succeeded and what the failure modes looked like. This “red-teaming” exercise builds both AI literacy and metacognitive awareness of their own potential misconceptions.
Use of Agentic AI to Explore: An agentic STEM tutor is constructed with explicit domain-boundary awareness: it must signal when it is operating outside high-confidence domains, must refuse to present contested empirical claims as settled, and must route genuine scientific uncertainty to primary source citations rather than generating plausible-sounding summaries. Students compare this constrained agent’s performance on edge cases (quantum mechanics, statistical inference, ecological modeling) against an unconstrained LLM on the same prompts.
Case Study: Teaching Statistics with AI — The P-Value Problem Statistical reasoning is one of the most reliably mis-taught and mis-learned subjects in undergraduate education. This case study follows a statistics instructor who redesigned a 200-student introductory course around AI-assisted Socratic dialogue, using LLM interactions to surface and correct the most common p-value misconceptions (e.g., “p < .05 means the null is probably false”). Pre/post assessment data and student interaction logs are analyzed.
Chapter 5: The Artificial Empath
Core Claim: AI can function as a scalable mental health support tool for students and as an administrative relief mechanism for teachers, reducing burnout while maintaining human connection at the instructional core. Both claims require careful evaluation of what “functioning as support” means and what risks attend emotional attachment to AI systems.
Logical Method: Problem documentation (teacher burnout data, post-2020 mental health crisis statistics) + historical precedent (ELIZA effect) + preliminary clinical evidence (single chatbot study) + capability demonstrations (Angela Duckworth collaboration on AI-delivered psychological interventions).
Methodological Soundness: The ELIZA comparison cuts both ways: its creator Joseph Weizenbaum was alarmed by human emotional attachment to the simulation, not encouraged. The South China University of Technology chatbot study showing depression reduction “within 4 months” lacks reported sample size, effect size, comparison condition, or replication. The GazeTutor literature documents that affect-sensitive AI tutoring produces heterogeneous outcomes — helping some student groups and producing null or negative results for others. These nuances are absent from most AI mental health advocacy.
Use of LLMs to Explore: Students interact with an LLM configured in a supportive, non-directive listening mode and then critique the interaction: Where did the model’s responses feel genuinely helpful? Where did they feel hollow, evasive, or formulaic? Where did the model’s limitations (inability to call for help, inability to maintain session memory, inability to read non-verbal cues) become clinically significant? This exercise builds critical awareness of the difference between AI-assisted support and human therapeutic relationship.
Use of Agentic AI to Explore: An agentic teacher-assistant is deployed to handle five categories of administrative tasks: lesson plan generation, rubric creation, parent communication drafting, progress report summarization, and IEP accommodation mapping. Teachers in a pilot cohort track time savings, quality assessment, and residual cognitive load. The agent is evaluated not only on task completion but on whether it reduces or merely relocates teacher effort.
Case Study: AI Mental Health Support in a Rural School District Rural districts face a documented mental health professional shortage — often one counselor serving 500+ students. This case examines a district that deployed an AI-assisted triage and support tool, documenting what categories of student need the system addressed effectively, what categories it failed to address or made worse, how it interfaced with the human counseling staff, and what governance structures were required for responsible deployment.
Chapter 6: Auditing the Algorithm
Core Claim: Generative AI’s risks around bias and misinformation are real but comparable to — and in some respects more auditable than — the biases embedded in pre-existing educational and information systems. The correct standard for evaluation is improvement over status quo, not perfection. This claim is more defensible than most AI advocacy positions, but it still requires operationalization.
Logical Method: Comparative baseline argument: AI bias is auditable across thousands of test cases in ways that human bias actively resists; documented human bias in admissions (Harvard, 2018) and information ecosystems (SEO gaming, algorithmic amplification) establishes the relevant baseline. Khanmigo’s flat-earth refusal demonstrates one class of bias mitigation.
Methodological Soundness: “Better than the current biased system” does not entail “good enough to deploy at scale in high-stakes educational contexts.” The relevant question is whether AI bias compounds with existing biases or substitutes for them — and this requires empirical investigation, not theoretical argument. Data privacy claims in most AI-in-education deployments remain vague about retention periods, data use conditions, and third-party access — operationalizing “transparency” is the unfinished work of this section.
Use of LLMs to Explore: Students design a structured audit of an LLM’s responses to demographically varied educational scenarios: the same student learning challenge presented with names and contextual signals associated with different racial, gender, and socioeconomic backgrounds. They analyze response patterns for differential treatment — in tone, in assumed competence, in resource recommendations. Results are compared against documented human teacher bias patterns from the education research literature.
Use of Agentic AI to Explore: An agentic grading assistant is constructed and tested against a corpus of student essays deliberately varied for surface-level demographic signals (names, dialectal features, cultural references) while holding argumentative quality constant. The agent’s scoring consistency is evaluated against human grader consistency on the same corpus. Both human and AI grader error patterns are analyzed.
Case Study: Algorithmic Admissions Bias — Lessons from the UK A-Level Crisis (2020) When the UK replaced canceled A-level exams with an algorithmic grade prediction system during COVID-19, the algorithm systematically downgraded students from historically lower-performing schools — disproportionately affecting students from lower-income and minority backgrounds. This case uses the A-level crisis as a template for analyzing how algorithmic bias operates in high-stakes educational contexts, what governance failures enabled it, and what audit mechanisms could have detected it earlier.
Chapter 7: Flipping the Classroom
Core Claim: AI teaching assistants will make teaching sustainable by reducing administrative load, enabling genuine differentiation, and making the flipped classroom model more effective than pre-AI implementations — because AI can now provide the individualized instruction during “content delivery” phases that earlier flipped models lacked.
Logical Method: Prescriptive design thinking: what teachers should do differently + AI capability demonstrations for lesson plan generation, differentiation, and formative assessment. Supported by Mollick’s three adjustments framework and homeschooling growth data.
Methodological Soundness: The flipped classroom model has been extensively studied with mixed results — its effectiveness is highly dependent on implementation quality, student self-regulation capacity, and home environment. AI-augmented flipping has not yet been studied at comparable scale. The SchoolHouse.world transcript adoption by 18 universities is a verifiable innovation, but the “higher acceptance rates for submitting students” finding is selection-confounded: students who invest hundreds of hours tutoring peers differ systematically from those who don’t, independent of any credential signal.
Use of LLMs to Explore: Teachers — not students — are the primary LLM users in this chapter’s exercise. Participants use an LLM to generate lesson plans for a unit they know well, then critically evaluate the output against their own professional knowledge: what did the model get right about their students’ likely prior knowledge gaps? What did it miss? What would a first-year teacher not know to question? The exercise builds AI-literate teacher judgment rather than AI dependence.
Use of Agentic AI to Explore: An agentic differentiation assistant is configured to analyze a class roster’s performance data (anonymized) and generate tiered instructional materials for a single lesson — producing three versions at different complexity levels with suggested discussion prompts for each. Teachers evaluate the materials for accuracy, appropriateness, and whether the tiering assumptions match their actual knowledge of student needs.
Case Study: Homeschool Co-ops Using AI — A Three-Family Study This case follows three homeschooling families across a semester, each using AI tools differently: one using AI primarily for content delivery, one using AI primarily for assessment and feedback, and one using AI as a collaborative learning partner alongside human tutors. Outcomes are measured on academic progress, student self-direction, and parent workload. The case illuminates both the promise and the limits of AI-facilitated homeschooling.
Chapter 8: Beyond the Bubble Sheet
Core Claim: AI enables richer, more continuous, and more equitable assessment practices — replacing one-time high-stakes testing with ongoing performance documentation, enabling conversational assessment of reasoning rather than answer-recall, and reducing the role of test-preparation resources that advantage higher-income students.
Logical Method: Current standardized testing critique (narrow measurement, political contestation, coaching-susceptibility) + AI capability demonstrations for conversational assessment + institutional adoption evidence (SchoolHouse.world) + labor market signals (Chegg collapse, IBM hiring suspension).
Methodological Soundness: The 26–38% “exceeding growth projections” finding for Khan Academy users engaging 30+ minutes per week is the book’s most methodologically specific effectiveness claim — but it remains self-selected for motivation, making causal attribution to the platform difficult. The AI admissions interview concept raises bias concerns that are acknowledged but not answered with specific evidence of reduced bias relative to current practice. The AI employment matching section is largely speculative.
Use of LLMs to Explore: Students experience a conversational assessment on a topic they have studied: an LLM probes their understanding through follow-up questions, requests for examples, and “convince me you understand X” challenges. Students then compare this experience — in depth of engagement, in ability to demonstrate nuanced understanding, and in anxiety level — against their experience of traditional multiple-choice assessment on the same material.
Use of Agentic AI to Explore: An agentic portfolio assessment system is prototyped: the agent collects student work samples across a semester, identifies patterns of growth and persistent gaps, generates a narrative assessment report, and flags areas where teacher review is warranted. The agent’s reports are evaluated by teachers for accuracy, actionability, and appropriate epistemic humility about its own limitations.
Case Study: SchoolHouse.world — Peer Tutoring as Credential Sal Khan’s SchoolHouse.world platform enables students to earn verified transcripts documenting peer tutoring hours, which 18 universities now accept as part of admissions consideration. This case examines the platform’s design logic, the admissions officers’ reasoning for accepting the credential, the selection-confounding problem that makes efficacy hard to measure, and what a more rigorous study of the credential’s predictive validity would require.
Chapter 9: The Global Classroom
Core Claim: AI tutoring is the first technology with genuine potential to close global educational equity gaps — not merely in access to content, but in access to the kind of personalized, adaptive instruction that has historically been available only to students in affluent communities with access to skilled tutors. The cost trajectory makes this plausible within a decade; the failure to pursue it actively would represent a moral and strategic failure of the first order.
Logical Method: Problem scale (documented gaps in course availability; post-pandemic learning loss; infrastructure data) → current partial solutions (platform reach, smartphone penetration) → AI as scaling mechanism → cost trajectory → dual-use warning (authoritarian AI vs. democratizing AI).
Methodological Soundness: The equity argument rests on the most concrete evidence base in the book: documented course availability gaps (50% of US high schools lack calculus; 40% lack physics; 62% of high-enrollment Black and Latino schools lack calculus), verified platform reach data, and observable compute cost trajectories. The cost projection ($5–15/month → 100x reduction in 5–10 years) is speculative and assumes continued scaling without plateau. The utopian/dystopian framing (Star Trek vs. Orwellian surveillance) functions as a moral call to action rather than a logical conclusion from the preceding evidence — it should be read as such.
Use of LLMs to Explore: Students in two different institutional contexts (a well-resourced university course and an under-resourced community college course) complete identical LLM-assisted learning tasks and compare their experiences: What background knowledge did the LLM assume? Whose examples did it default to? What cultural contexts were invisible? The exercise surfaces the ways in which current LLMs may replicate rather than reduce educational inequity despite their theoretical accessibility.
Use of Agentic AI to Explore: An agentic global classroom simulator is constructed: students in one location act as “tutors” interacting with an AI agent simulating a learner from a low-connectivity, low-resource context (slow internet, text-only interface, limited prior formal schooling). The exercise surfaces the design assumptions baked into current AI educational tools and generates student-led proposals for making those tools genuinely inclusive.
Case Study: AI-Assisted Learning in Sub-Saharan Africa — BRCK and Kolibri While Khanmigo requires internet connectivity and a paid subscription, offline-first platforms like Kolibri (built by Learning Equality) and mesh-network hardware like BRCK have pursued AI-assisted education under genuine low-resource constraints. This case examines what AI-augmented learning actually looks like in a Ugandan or Kenyan context — what works, what doesn’t, what infrastructure assumptions must be abandoned, and what lessons these deployments offer for the “AI will democratize education” thesis.
Appendix A: Evaluation Rubric for AI Tutoring Claims
A structured framework for assessing educational AI efficacy claims, including questions about control conditions, selection bias, outcome operationalization, population specificity, and replication status.
Appendix B: AI Acceptable-Use Policy Template
A modular policy framework for K–12 and higher education institutions, with decision criteria for each module.
Appendix C: Annotated Bibliography
Key references from the educational AI, intelligent tutoring systems, and learning science literature, with methodological notes.
“The right question is not whether AI is perfect. The right question is whether AI, deployed thoughtfully, produces better outcomes than the system it supplements or replaces — and for whom.”
