← Back to Blog

Twelve Research Paths Hidden in Plain Sight

Possible projects for Humanitarians AI Fellows

·12 min read

THIS ARTICLE IS FOR HUMANITARIANS AI VOLUNTEERS CONSIDERING LEARNING ENGINEERING RESEARCH

You’re reading a textbook evaluation that catalogs intelligent education systems. But embedded in each chapter is an unanswered question—the kind that keeps researchers awake, that separates marketing claims from measurable truth.

I’m going to show you twelve research projects hiding in this document. Not hypotheticals. Real investigations you could start tomorrow with access to the right data and the willingness to challenge assumptions everyone else accepts.

Project 1: The Passivity Paradox

The Claim: Traditional textbooks encourage passive learning. Intelligent systems make learning active.

What We Actually Don’t Know: Does the medium cause passivity, or does pedagogy?

Here’s the experiment no one has run: Take two groups of students. Give the first group a traditional textbook with explicitly designed active reading protocols—annotation guides, Socratic question prompts, synthesis exercises. Give the second group an intelligent textbook system with adaptive quizzes and instant feedback but no metacognitive scaffolding.

Measure not just completion rates but cognitive engagement: Do students in the traditional group who annotate systematically show deeper conceptual understanding than students in the intelligent group who click through generated quizzes? Do they transfer knowledge better six months later?

The hypothesis you’re testing: Format is less important than instructional design. If true, it means we’ve been solving the wrong problem. If false, it justifies the billions being invested in intelligent systems.

You’d need: 200+ students, pre/post assessments measuring conceptual understanding (not just recall), delayed transfer tests, and control for instructor quality. Timeline: one academic year. Difficulty: medium—requires institutional cooperation but no proprietary technology access.

Project 2: The Embedding Quality Blind Spot

The Claim: Retrieval-Augmented Generation grounds AI responses in authoritative content by embedding textbook passages as vectors and matching student queries via cosine similarity.

What We Actually Don’t Know: How often do embedding models encode conceptually related material as semantically distant?

The document provides the cosine similarity formula:

similarity(A,B)=A⋅B∥A∥∥B∥\text{similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}similarity(A,B)=∥A∥∥B∥A⋅B​

But this measures vector angle, not pedagogical relevance. If a student asks “Why do leaves change color?” and the embedding model represents “anthocyanin pigments” and “energy transfer during senescence” as low-similarity concepts, retrieval fails—even though both are essential to a complete answer.

Here’s what you’d investigate: Build a test corpus of 1,000 student questions from actual tutoring sessions in biology, chemistry, and physics. For each question, have expert instructors identify all textbook passages that would be pedagogically useful—not just factually relevant, but useful for building understanding given common student misconceptions.

Then run multiple embedding models (bge-m3, sentence-transformers, OpenAI’s ada-002) and measure: How often does the top-retrieved passage match expert judgment? More importantly, how often do high-similarity retrievals miss critical conceptual connections that a human tutor would make?

The hypothesis: Embedding models optimize for semantic overlap, not conceptual scaffolding. If retrieval accuracy drops below 70% on questions involving misconception repair, RAG systems are unreliable where students need them most.

You’d need: Access to tutoring session transcripts, expert instructor time for annotation, compute for running multiple embedding models, dataset of ~1,000 student questions. Timeline: 6 months. Difficulty: medium-high—requires domain expertise and computational resources.

Project 3: The Threshold That Changes Everything

The Claim: RAG systems retrieve textbook passages when student queries exceed a similarity threshold.

What We Actually Don’t Know: What is that threshold, and who decided it?

At 0.7 similarity, the system might retrieve too little—leaving students confused. At 0.5, it might retrieve too much—overwhelming them with tangentially related material. The document never mentions this threshold. That’s telling.

You’d design a threshold sensitivity analysis: Take 500 real student queries. For each query, compute similarity scores for all textbook passages in a standard introductory biology text. Then vary the threshold from 0.4 to 0.9 in 0.05 increments.

At each threshold, measure:

  • Precision: How many retrieved passages are actually useful?

  • Recall: How many useful passages were retrieved?

  • Cognitive load: Do students with 10 retrieved passages learn better or worse than students with 3?

The hypothesis you’re testing: There’s a trade-off between comprehensiveness and usability. You’ll likely find that the optimal threshold varies by student expertise—novices need fewer, more targeted passages; experts benefit from broader retrieval. If that’s true, fixed-threshold systems fail both groups.

You’d need: Biology textbook corpus, 500 authentic student questions, expert annotation of passage relevance, A/B testing platform. Timeline: 4 months. Difficulty: medium—technical but straightforward.

Project 4: The Ready-to-Use Fiction

The Claim: Platforms like Nolej and CourseMagic automatically generate “ready-to-use” learning materials—15+ activity types from uploaded content.

What We Actually Don’t Know: What percentage of auto-generated content contains errors or requires substantial editing?

You’d run a quality audit: Upload 20 textbook chapters across four disciplines (biology, history, economics, computer science) to three auto-generation platforms. Generate quizzes, flashcards, interactive videos, and summaries.

Then have two independent expert reviewers score each generated artifact on:

  • Factual accuracy (0-100%)

  • Pedagogical soundness (does this activity teach what it claims to teach?)

  • Readiness (can this be used as-is, or does it need editing?)

For activities requiring editing, measure: How long does an expert instructor need to fix it? If the average edit time is 30 minutes per quiz and a human-written quiz takes 45 minutes, the time savings are marginal—not the order-of-magnitude improvement marketing materials promise.

The hypothesis: “Ready-to-use” conflates speed with quality. If more than 30% of generated content requires substantial revision, the efficiency claims collapse.

You’d need: Access to three platforms, 20 textbook chapters, 6 expert reviewers (2 per discipline), time-tracking data. Timeline: 3 months. Difficulty: low-medium—requires reviewer time but no specialized technology.

Project 5: The 12% Dyslexia Mystery

The Claim: ProfJim’s avatar-based instruction increased retention by 12% for dyslexic learners.

What We Actually Don’t Know: Retention of what? Over what time period? Compared to what baseline?

This is the kind of statistic that sounds impressive until you ask basic methodological questions. You’d replicate and extend:

Recruit 100 dyslexic students. Randomly assign them to four conditions:

  1. Static text (traditional textbook)

  2. Static text with audio narration (audiobook)

  3. AI avatar instructor (ProfJim-style)

  4. Human video instructor (recorded lecture)

Teach the same biology content (cell structure) across all conditions. Test immediately after instruction (short-term retention), then again one week later (consolidation), then one month later (long-term retention).

Measure not just recall (name the organelles) but application (diagnose what’s wrong with a cell given symptoms).

The hypothesis you’re testing: Multi-modal presentation helps short-term retention but not long-term transfer. If the avatar condition shows gains that disappear after a week, the 12% improvement is cosmetic—students enjoy watching Aristotle explain mitosis, but they don’t internalize it.

You’d need: 100 dyslexic participants, four instructional conditions, pre/post/delayed assessments, IRB approval. Timeline: 6 months. Difficulty: medium—requires ethical approval and participant recruitment.

Project 6: The Comfort Illusion

The Claim: 100% of students in Google’s “Learn Your Way” pilot reported feeling “more comfortable” taking assessments after using AI-adjusted materials.

What We Actually Don’t Know: Does comfort correlate with learning?

This is the affective vs. cognitive problem. Students might feel comfortable because the material is easier, not because they understand it better. Comfort could signal reduced challenge rather than improved learning.

You’d design a comfort-performance disconnect study: Use Google’s system (or replicate its functionality) to adjust OpenStax biology content to three reading levels: 8th grade, 10th grade, 12th grade.

Give 300 students pre-tests to measure baseline knowledge, then randomly assign them to:

  • Challenge condition: Material one level above their tested ability

  • Comfort condition: Material matched to their tested ability

  • Control condition: Original 12th-grade material

Measure both self-reported comfort and actual learning gains (pre/post test on same difficulty level).

The prediction: Students in the comfort condition will report higher satisfaction but show lower learning gains than students in the challenge condition. If true, optimizing for comfort sacrifices learning. That’s not an argument against accessibility—it’s an argument for challenge within the zone of proximal development, which AI systems might not identify correctly.

You’d need: 300 students, adaptive content generation, validated pre/post assessments, satisfaction surveys. Timeline: 4 months. Difficulty: medium—requires content adaptation capability.

Project 7: The Scope Preservation Problem

The Claim: AI systems simplify vocabulary while maintaining conceptual scope.

What We Actually Don’t Know: Is this mechanically possible?

Try this yourself: Take a college-level passage about entropy. Simplify the vocabulary to 8th-grade level while preserving the full conceptual scope—including reversibility, statistical mechanics, and the second law of thermodynamics.

You’ll find it’s nearly impossible. Simplifying “entropy” to “disorder” loses precision. Removing “statistical mechanics” removes the mechanism. What remains is a vague approximation.

You’d conduct a scope degradation analysis: Select 50 passages from college-level physics, chemistry, and biology texts. Use AI systems to “re-level” them to 8th-grade, 10th-grade, and 12th-grade vocabularies.

Then have domain experts rate each simplified version on conceptual completeness: Does it preserve cause-and-effect relationships? Does it maintain quantitative relationships? Does it explain mechanisms or just describe phenomena?

The hypothesis: Vocabulary simplification systematically erodes conceptual scope. The more you simplify, the more you lose. If simplification reduces scope by more than 30%, “accessible” materials are teaching something different than advertised.

You’d need: 50 textbook passages, AI simplification tools, 10 domain expert reviewers, rubric for conceptual completeness. Timeline: 3 months. Difficulty: low-medium—labor-intensive but straightforward.

Project 8: The Generic Feedback Trap

The Claim: AI systems like LearnWise generate draft feedback and rubric-based grades, streamlining instructor workflow via “human-in-the-loop” oversight.

What We Actually Don’t Know: How generic is AI-generated feedback, and do instructors actually revise it?

You’d run a feedback quality experiment: Collect 100 student essays from a university writing course. Have AI generate feedback using LearnWise or a comparable system. Separately, have human instructors provide feedback.

Give both versions to students (blinded to source) and measure:

  • Perceived helpfulness (student ratings)

  • Specificity (does feedback reference specific passages, or is it generic?)

  • Actionability (can students revise based on feedback alone?)

Then introduce the human-in-the-loop condition: Give instructors AI-generated feedback and ask them to revise it. Track:

  • How many instructors make substantive changes vs. minor edits vs. no changes?

  • How long do revisions take?

  • Does revised AI feedback match the quality of purely human feedback?

The hypothesis: Under time pressure, instructors rubber-stamp AI feedback without critical revision. If 70%+ of instructors make only cosmetic changes, the safeguard fails. The AI Sandwich becomes AI with a thin human veneer.

You’d need: 100 student essays, AI feedback system access, 20 instructors, student perception surveys. Timeline: 4 months. Difficulty: medium—requires instructor participation and blinded study design.

Project 9: The Integration Seamlessness Myth

The Claim: LTI and Common Cartridge standards enable seamless integration between intelligent textbook systems and LMS platforms.

What We Actually Don’t Know: How often does “seamless” actually mean “broken in subtle ways”?

You’d conduct an interoperability stress test: Take one intelligent textbook course package. Export it using Common Cartridge (IMSCC) format. Import it into five LMS platforms: Canvas, Blackboard, Moodle, Brightspace, Open edX.

Document every failure mode:

  • Do hierarchies flatten?

  • Do embedded scripts break?

  • Do custom HTML elements render correctly?

  • Do grading workflows sync properly?

  • Do student analytics transfer accurately?

Then measure the time required to manually fix broken imports. If fixing a Canvas import takes 5 hours and creating the course from scratch takes 8 hours, the “interoperability” saved you 3 hours—not the 20+ hours marketing materials suggest.

The hypothesis: Technical standards promise compatibility but deliver fragile integrations. If more than 40% of imports require manual intervention, interoperability is aspirational, not actual.

You’d need: Access to five LMS platforms, sample intelligent textbook course, technical documentation, time-tracking. Timeline: 2 months. Difficulty: low-medium—tedious but not complex.

Project 10: The Persona Accuracy Gamble

The Claim: AI systems synthesize learner personas when direct data is unavailable, achieving 85% correlation with actual human responses.

What We Actually Don’t Know: What happens to the 15% who are miscategorized?

You’d design a misclassification harm study: Use a persona-driven adaptive system. Track 500 students across a semester. For each student, record:

  • System-assigned persona (e.g., “Novice,” “Expert,” “Busy Professional”)

  • Student’s actual expertise (via pre-test)

  • Content delivered based on persona

  • Learning outcomes (post-test)

Identify students whose assigned persona mismatched their actual expertise. Measure: Did miscategorized students perform worse than correctly categorized students? How much worse?

If miscategorization reduces performance by 20%+, the 85% correlation stat is misleading—it means 15% of students are actively harmed by personalization.

The hypothesis: Persona-based adaptation helps when accurate, harms when inaccurate. If harm to the 15% outweighs benefit to the 85%, the system should default to non-personalized content.

You’d need: Adaptive learning system access, 500 students, pre/post testing, persona-assignment logs. Timeline: 6 months (one semester). Difficulty: medium—requires system access and longitudinal tracking.

Project 11: The Socratic Shortcut Problem

The Claim: Socratic AI (Khanmigo) prevents cheating by refusing direct answers and requiring students to explain their reasoning.

What We Actually Don’t Know: Can students game Socratic systems by outsourcing step-by-step thinking?

You’d conduct a Socratic misuse study: Give 200 students a problem set. Half can use Khanmigo with standard Socratic prompts. Half use traditional solution manuals.

Measure:

  • Time to complete problem sets (does Socratic guidance take longer?)

  • Accuracy (do students using AI make fewer errors?)

  • Transfer performance: One week later, give students novel problems without AI access. Can they solve them independently?

The critical test: If students using Socratic AI perform better on the problem set but worse on transfer problems, they’ve learned to follow AI guidance without internalizing problem-solving strategies. The system prevented direct answer-giving but enabled scaffolded dependency.

The hypothesis: Socratic AI delays cheating detection rather than preventing it. Students appear to be thinking (they’re responding to prompts), but they’re not developing independent reasoning capacity.

You’d need: 200 students, Khanmigo access, problem sets with transfer problems, pre/post/delayed testing. Timeline: 6 weeks. Difficulty: low-medium—requires AI access and student recruitment.

Project 12: The Multi-Agent Performance Collapse

The Claim: Future systems will use multi-agent architectures (learner agents, teacher agents, evaluator agents) for scalable personalization.

What We Actually Don’t Know: Does coordination overhead destroy usability?

You’d build a performance benchmark: Implement both single-agent and multi-agent RAG systems. Give each system 100 student queries of increasing complexity.

Measure:

  • Response time (seconds from query to answer)

  • Response quality (expert rating of pedagogical value)

  • Consistency (do agents sometimes contradict each other?)

Plot response time vs. query complexity. If multi-agent response time exceeds 10 seconds for complex queries, students will abandon the system—regardless of answer quality. Human patience has limits.

The hypothesis: Multi-agent systems provide better answers at the cost of worse user experience. If response time doubles while quality improves by only 20%, the trade-off fails. Students need answers that are good enough delivered quickly more than perfect answers delivered slowly.

You’d need: Multi-agent framework (AutoGen or similar), RAG implementation, 100 test queries, expert quality raters, performance monitoring. Timeline: 4 months. Difficulty: high—requires distributed systems expertise.


The Pattern Across All Twelve

Notice what these research projects have in common: they test assumptions the document treats as settled. They measure outcomes the case studies don’t report. They ask questions the technology vendors don’t want answered.

These aren’t adversarial investigations. They’re necessary ones. The gap between what intelligent textbook systems promise and what they demonstrably deliver is where research lives. Every “critical juncture” claim, every “275% increase” statistic, every “ready-to-use” assertion is a hypothesis waiting for rigorous testing.

The research path forward isn’t building better AI systems. It’s measuring whether existing systems actually do what they claim—and where they fail, understanding why.

You can start any of these projects tomorrow. Most require institutional access you can negotiate, student populations you can recruit, and expertise you already have. What they require most is willingness to challenge the consensus that intelligent textbooks are inevitable progress.

Maybe they are. But we won’t know until someone runs these experiments.

The interesting part? Whichever way the data falls—whether intelligent systems prove transformative or merely incremental—the answer matters. Because we’re betting education’s future on technology we haven’t yet properly evaluated.

That’s not innovation. That’s faith.

Research replaces faith with evidence. Pick a project. Run the experiment. Tell us what you find.

Nik Bear Brown Poet and Songwriter