Preprint · Irreducibly Human Research Series · skepticism.ai · theorist.ai · irreduciblyhuman.com

Measuring the Struggle: Process Friction Traces as Independent Evidence of Genuine Learning in the Age of Generative AI

Nik Bear Brown
Associate Teaching Professor of Computer Science and AI, Northeastern University
Founder, Humanitarians AI (501(c)(3)) and Bear Brown and Company
Preprint Not Peer Reviewed Open Methodology GLP Framework
Abstract

The artifact — the essay, the examination, the project, the recorded performance — has served as the primary evidence of genuine learning for as long as formal education has existed. This evidential relationship rested on a causal assumption: only the process of genuine learning could produce the artifact. Generative AI has broken this assumption. The artifact can now be produced without the process. Any assessment infrastructure built solely on artifact analysis therefore has a finite and shrinking evidentiary lifespan as generation technology improves.

This paper argues that process friction traces — the behavioral signatures that genuine human cognitive engagement leaves in observable data — constitute an independent evidence stream that compensates for what the artifact can no longer reliably tell us alone. We develop the Genuine Learning Probability (GLP) framework, a probabilistic methodology specifying seven observable friction components: temporal engagement pattern, error trajectory coherence, cross-context transfer, uncertainty calibration, social knowledge texture, retrieval strength decay signature, and scaffolding response curve.

The paper makes four claims. First, genuine human learning at each cognitive tier leaves characteristic friction traces arising from the irreducible complexity of the neurological processes that constitute learning. Second, these traces are partially independent of artifact quality and provide information about genuine engagement that the artifact alone does not contain. Third, the composite GLP score is substantially more robust to gaming than any individual component because manufacturing all friction traces simultaneously without performing the underlying cognitive work is essentially equivalent in cost to performing that work. Fourth, the GLP framework is not a replacement for artifact-based assessment but a formally specified second evidence stream that instructors can combine with artifact evidence in whatever proportion their professional judgment warrants.

Keywords: assessment · generative AI · learning analytics · friction traces · genuine learning probability · desirable difficulties · cognitive load theory · ensemble learning · formative assessment · process-based evidence
Claim 01

Genuine human learning at each cognitive tier leaves characteristic friction traces arising from the irreducible complexity of the neurological processes that constitute learning.

Claim 02

These traces are partially independent of artifact quality and provide information about genuine engagement that the artifact alone does not contain.

Claim 03

The composite GLP score is substantially more robust to gaming than any individual component — manufacturing all friction traces without genuine engagement is essentially equivalent in cost to genuine engagement.

Claim 04

The GLP framework is a formally specified second evidence stream. Instructors combine it with artifact evidence in whatever proportion their professional judgment warrants.

1.Introduction

The artifact used to be proof of the process. It no longer is.

For most of human educational history this statement would have been meaningless. The essay demonstrated thinking because only thinking could produce the essay. The proof demonstrated mathematical understanding because only mathematical understanding could produce the proof. The clinical note demonstrated clinical reasoning because only clinical reasoning could produce the clinical note. The artifact and the process that produced it were causally coupled tightly enough that measuring one was effectively measuring both.

Generative AI has severed this coupling. A well-structured essay can now be produced in seconds by a system that has performed none of the cognitive work the essay was designed to evidence. A correct proof can be generated by a system with no mathematical understanding in any sense that matters for the student's development. The artifact exists. The process that should have produced it did not occur.

This decoupling is not a temporary condition that will resolve as detection technology improves. It is a permanent structural change driven by the continuous improvement of generation technology. The forensic window — the period during which artifact analysis can reliably distinguish AI-generated from human-generated work — is closing sequentially across domains. In writing it is largely closed already. In code it is closing. In visual art it closed years ago.

The Structural Limitation of Detection

Any detector trained on current AI outputs becomes obsolete as generation technology advances. The arms race between generation and detection has a predictable winner. This paper proposes measuring what the artifact used to be evidence of directly — the process of genuine learning itself.

2.The Decoupling Problem

2.1 The Causal Chain That Broke

Traditional assessment rests on an implicit causal model:

Original: Genuine Learning Cognitive Process Artifact

The artifact was never the thing we cared about. We cared about the cognitive process — the schema formation, the conceptual development, the capacity for transfer. The artifact was valuable as evidence because it was causally downstream of the process. Generative AI inserts a bypass:

Bypassed: AI Generation Artifact (identical)

The artifact now has two causal pathways. One passes through genuine cognitive process. The other bypasses it entirely. The artifact is identical at the end of both pathways — or will be, as generation technology improves.

2.2 Why Detection Cannot Solve This

Artifact-based AI detection attempts to distinguish the two causal pathways by analyzing properties of the artifact. These approaches face three structural limitations:

2.3 The Bjorkian Insight Applied

Robert and Elizabeth Bjork's foundational distinction between performance and learning is directly relevant. Performance is the observable, often temporary fluctuation in behavior during or immediately after instruction. Learning is the more permanent change in knowledge that supports subsequent access and transfer in novel contexts.

The artifact measures performance. AI assistance is the limiting case: it maximizes performance while minimizing — potentially eliminating — the learning process. What we need to measure is learning, not performance. Learning leaves traces that performance does not. These traces are not in the artifact. They are in the data that surrounds the artifact's production.

3.The Neurobiological Foundation of Friction Traces

3.1 Why Genuine Learning Leaves Traces

Friction traces are not metaphorical. They are behavioral consequences of neurobiological events that constitute genuine learning.

Neurobiological Basis

Dopamine neurons fire in response to prediction errors — discrepancies between what the learner expected and what they encountered. This phasic dopamine release facilitates NMDA receptor trafficking and initiates long-term potentiation. BDNF expression is upregulated as much as 2.8-fold during moderate cognitive challenge. Dendritic spine formation increases by 37% under moderate cognitive load compared to low-load conditions. An AI can produce the artifact without triggering any of these events. It cannot produce the behavioral traces those events leave, because the events did not occur.

3.2 The Storage-Retrieval Distinction

Bjork's New Theory of Disuse distinguishes storage strength — how thoroughly a memory is encoded and integrated — from retrieval strength — how accessible the memory currently is. High retrieval strength immediately after learning does not indicate high storage strength. A student who processed an AI explanation has high retrieval strength in the short term, but without the effortful encoding that builds storage strength, retrieval strength decays rapidly and the spacing effect — the benchmark of genuine learning — does not appear.

3.3 The Fluency Trap as a Measurement Problem

The brain confuses perceptual fluency — the ease with which information is processed — with understanding. This fluency trap has a counterintuitive implication: in the AI era, high artifact quality achieved through borrowed fluency may be mild negative evidence of genuine learning, because genuine struggle with difficult material characteristically produces roughness — places where the student lost the thread, found it again, approached the same concept from multiple angles before landing on a formulation.

The smooth, well-structured artifact may be evidence of borrowed certainty. The rough, searching artifact may be evidence of genuine engagement.

4.The Genuine Learning Probability Framework

4.1 Foundational Definitions

Let S denote a student, C a concept or skill being learned, and a learning episode spanning observation window Ω = [t₀, t₀ + τ]. Define the cognitive engagement state E as the set of neurological and behavioral processes activated during .

Genuine Learning Probability — Core Definition
GLP(S, C) = P(E genuine | Y)
where Y = (Y₁, Y₂, Y₃, Y₄, Y₅, Y₆, Y₇) is the vector of observable learning friction traces.

GLP is a property of the engagement process, not the artifact.
GLP is probabilistic and continuous — scored in [0,1] with explicit credible interval.
GLP is tier-sensitive — calibrated to the cognitive tier the activity is designed to develop.

4.2 The Seven Components

Y₁ Temporal Engagement Pattern (TEP)
Y₁ = corr(d, τ)
d = vector of item difficulty ratings · τ = vector of observed time-on-task values across items

Genuine engagement with material of varying complexity produces characteristic time-on-task distributions. Intrinsic cognitive load — the element interactivity of the material — predicts engagement time when processing is genuine. Borrowed certainty decouples time from difficulty: the student spends time proportional to explanation length, not conceptual challenge.

GenuinePositive correlation — time tracks difficulty
BorrowedNear-zero correlation — time tracks output length
Infrastructure:LMS clickstream data — routinely collected, rarely analyzed for this purpose
Y₂ Error Trajectory Coherence (ETC)
Y₂ = Σ A(eᵢ, eᵢ₊₁) / Σ 𝟙[eᵢ ≠ eᵢ₊₁]
A = conceptual adjacency matrix — A(j,k)=1 if misconceptions j and k follow sequentially in typical developmental trajectories

The reward prediction error mechanism produces coherent error evolution during genuine learning. Each error is a prediction violation that updates the mental model. Because updates are cumulative, the error trajectory follows a path reflecting the concept's structure — early errors reflect initial misconceptions, later errors reflect more sophisticated partial understandings.

GenuineElevated Y₂ — errors follow conceptually meaningful paths
BorrowedY₂ ≈ 0 — errors random with respect to conceptual adjacency
Infrastructure:Sequenced formative assessment with misconception-coded items
Y₃ Cross-Context Transfer (CCT)
Y₃ = 0.4 · ρ_near + 0.6 · ρ_far
ρ_near = near transfer performance · ρ_far = far transfer performance on isomorphic problems with different surface features · α=0.4 weights far transfer more heavily as the stronger signal of genuine schema formation

Transfer — applying knowledge in novel contexts — is the Bjorkian definition of learning. Schema formation through germane cognitive load produces representations that generalize across surface variations. Borrowed certainty produces surface representations tied to the specific context of the AI explanation. The transfer gap ρ_near − ρ_far is independently diagnostic: large positive values indicate surface representation without schema.

GenuineElevated Y₃ — schema enables transfer across surface variations
BorrowedLarge transfer gap — pattern matching fails at surface variation
Infrastructure:Deliberately designed transfer problem sets with near and far transfer items — requires domain expertise in item design
Y₄ Uncertainty Calibration (UC)
Y₄ = 1 − Σ(cᵢ − rᵢ)² / B_ref
c = expressed confidence before item · r = correctness · B_ref = Brier score of a reference forecaster predicting the base rate · Y₄_dir = mean(cᵢ − rᵢ) — direction of miscalibration

Genuine learning through effortful retrieval and prediction error produces calibrated uncertainty — the student learns not just what is correct but what they know and do not know. Borrowed certainty produces systematic overconfidence — the student inherits the AI's confidence distribution without the knowledge base that would justify it.

GenuineCalibrated uncertainty — confidence tracks actual knowledge
BorrowedPositive Y₄_dir — inherited confidence exceeds genuine performance
Infrastructure:Confidence elicitation embedded in regular assessment through minor LMS quiz modifications
Y₅ Social Knowledge Texture (SKT)
Y₅ = (φ₁ + φ₂ + φ₃ + φ₄) / 8
φ₁ = personal encounter markers · φ₂ = below-surface engagement · φ₃ = productive uncertainty · φ₄ = real-time development · each scored 0–2

Genuine encounter with material leaves a characteristic texture in social and discursive contexts — specific confusions, particular connections to prior knowledge, questions that arose from genuine engagement. This texture cannot be manufactured without having had the experience of genuinely encountering the material.

GenuineSpecific confusions, real-time position changes, productive uncertainty
BorrowedGeneric statements, rehearsed texture, no real-time development
Infrastructure:Structured discussion coding with trained raters or validated rubric — most resource-intensive component; most diagnostically powerful for social dimensions
Y₆ Retrieval Strength Decay Signature (RSDS)
Y₆ = ρ_treatment(t₃) − ρ_control(t₃)
Assessments at t₁ (immediate), t₂ (short delay), t₃ (long delay) · treatment group receives spaced retrieval practice between t₁ and t₂

The spacing effect is the benchmark of genuine learning. Borrowed certainty has no storage strength to retrieve. Performance decays monotonically and the spacing effect is absent. At the individual level the decay curve shape is diagnostic independently of the experimental design.

GenuinePositive Y₆ — spaced retrieval benefits persist to long delay
BorrowedY₆ ≈ 0 — both groups show similar decay, no storage strength
Infrastructure:Spaced retrieval quizzes on previously covered material embedded in course design
Y₇ Scaffolding Response Curve (SRC)
Y₇ = [ρ(h₁) − ρ(h₀)] / [ρ(h₂) − ρ(h₀)]
h₀ = no hint · h₁ = partial structural hint · h₂ = full structural hint · Y₇_int = ρ_no-hint-subsequent − ρ_no-hint-initial (integration measure)

The Zone of Proximal Development is a structural property of a genuinely developing mental model. A student with genuine partial understanding has a ZPD — a region of near-competence that targeted scaffolding can activate. Borrowed certainty has no ZPD because there is no developing model for scaffolding to connect to.

GenuineElevated Y₇ — partial hint nearly as useful as full hint; positive integration
BorrowedLow Y₇ — only full hint produces improvement; requires hint again next problem
Infrastructure:Structured hint experiments in formative assessment or tutoring contexts — implementable through adaptive assessment platforms at scale

5.The Ensemble Architecture

5.1 Why Ensemble Rather Than Single Model

The seven GLP components have different statistical structures, different data types, and different failure modes. More importantly the components fail — can be gamed — in different ways. A student gaming all seven simultaneously is performing work that approaches the cost of genuine engagement — at which point the gaming has become indistinguishable from learning in the only sense that matters.

5.2 The Three-Layer Architecture

Ensemble Architecture — Layer Summary
LayerFunctionAlgorithmOutput
Layer 1 — Component ModelsOne base model per friction component, using the algorithm suited to that component's data structureSurvival analysis (Y₁) · Graph model (Y₂) · Gradient boosting (Y₃) · Isotonic regression (Y₄) · NLP model (Y₅) · Mixed effects longitudinal (Y₆) · Causal inference (Y₇)P(E genuine | Yᵢ)
Layer 2 — Tier-ConditionedSeven combination models, one per cognitive tier, that learn optimal weighting of Layer 1 outputs conditional on which tier the activity developsTier-specific ensemble weighting — learned from labeled dataP(E genuine | Y, tier)
Layer 3 — Meta-ModelFinal combination model; handles missing components gracefully by widening credible interval to reflect reduced informationMeta-ensembleGLP ∈ [0,1] with credible interval

5.3 The Instructor as Meta-Model

The ensemble architecture produces a formally specified GLP score. The instructor receives this score alongside the artifact quality score and combines them into an overall assessment judgment. The appropriate weighting depends on: the learning objectives; the cognitive tier being developed; the stakes; and the context. The paper provides the second evidence stream. The weighting belongs to the educator.

6.Tier Calibration

Each tier of the Irreducibly Human framework engages distinct cognitive and neurological processes. GLP measurement without tier calibration conflates fundamentally different kinds of cognitive work.

Tier-Level Friction Signature Summary
TierPrimary Cognitive ProcessPrimary GLP ComponentsAssessment Note
1 · PatternStatistical regularity detectionY₁, Y₃Least diagnostically useful — pattern recognition is AI's home territory
2 · EmbodiedSensorimotor schema formationAdapted Y₇, performance variationStandard components must be adapted to physical performance contexts
3 · SocialGenuine intersubjective cognitionY₅ (primary)Social texture most resistant to manufacturing; requires genuine contact with another perspective
4 · MetacognitiveOversight of one's own cognitive processesY₄, Y₂ (primary)Calibration that develops over the course is the characteristic trajectory
5 · CausalCounterfactual and interventionist reasoningY₃ (primary)Transfer is the primary diagnostic — causal understanding enables cross-surface generalization
6 · CollectiveEmergent group intelligenceGroup-level analysis requiredIndividual GLP measures structurally inadequate; contribution patterns showing genuine interdependence
7 · WisdomPractical judgment under genuine stakesDecision histories; expressed uncertainty specificityStandard assessment almost entirely inappropriate — stakes define the tier

7.Validation

7.1 Labeled Corpus Construction

The GLP framework requires a labeled corpus of confirmed genuine and confirmed borrowed engagement cases. Confirmed genuine engagement draws from students with documented engagement trajectories — convergent evidence across multiple components. Confirmed borrowed certainty draws from documented cases including students who submitted AI-generated work later acknowledged in academic integrity proceedings, experimental conditions where students were explicitly instructed to use AI without engaging with the material.

7.2 The Information Gain Test

The central empirical claim: process measurement adds independent information about genuine learning beyond what the artifact provides. This claim is directly testable:

Information Gain from Process Measurement
ΔI = I(θ; A, G) − I(θ; A)
θ = confirmed genuine learning status · A = artifact quality score · G = GLP score · Positive ΔI confirms that the process evidence stream is not redundant with the artifact evidence stream — it provides independent information about genuine learning.

8.The Arms Race Problem and Why This Framework Is Different

For artifact-based detection the arms race objection is decisive. Generation technology improves continuously; detection technology is always calibrated to past outputs. The window closes.

For process-based measurement the objection is substantially weaker for two reasons. First, manufacturing convincing friction traces across all seven components simultaneously without performing the underlying cognitive work is essentially equivalent in cost to performing the underlying cognitive work. Second, the framework is not trying to detect AI use. It is trying to measure genuine learning directly.

A student who manufacturing all seven friction traces has learned the material. At that point the gaming has become indistinguishable from learning in the only sense that matters. The framework has not been defeated. It has been satisfied.

9.Discussion

9.1 What This Paper Is Not Claiming

9.2 The Institutional Design Implication

If process friction traces are independent evidence of genuine learning, then institutional assessment infrastructure should be designed to make those traces observable. This means: longitudinal process documentation as primary rather than supplementary evidence; embedded formative assessment as the primary data source; developmental trajectory as credential.

The Urgency Argument

Portfolio assessment, formative evaluation, and developmental credentialing have been advocated for decades. The argument of this paper is that the AI decoupling makes them urgent rather than merely desirable — that the institutional cost of not building process-observable assessment infrastructure is now the progressive obsolescence of artifact-based credentialing.

9.3 The Ethics of Process Observation

Process observation for the purpose of supporting learning is categorically different from process observation for the purpose of surveillance. The distinction is in what the data is used for and who controls it. The GLP framework is designed to support the first use. Institutional implementation must actively guard against the second.

10.Conclusion

The artifact has been decoupled from the process that used to produce it. The decoupling is irreversible, accelerating, and domain-general. Any assessment infrastructure built solely on artifact analysis has a shrinking evidentiary lifespan.

Process friction traces — the behavioral signatures that genuine human cognitive engagement leaves in observable data — are an independent evidence stream that compensates for what the artifact can no longer reliably tell us alone. They exist because genuine learning is a biological event that produces behavioral consequences. They are partially independent of artifact quality. They provide information about genuine engagement that the artifact does not contain.

The Genuine Learning Probability framework formalizes this evidence stream as a probabilistic, tier-calibrated, ensemble-based measurement methodology. It does not replace artifact assessment. It gives artifact assessment a partner. The instructor determines the weighting.

The crisis of evidence facing educational institutions is not a technical problem requiring a better AI detector. It is an epistemological problem requiring a new evidence infrastructure — one built on the process of learning rather than its products. The artifact used to be proof of the process. It no longer is. Now we must measure the struggle itself.

References

Open Methodology

The GLP framework implementation, labeled corpus, and assessment design templates are published openly at irreduciblyhuman.com. The methodology is not a secret.