Irreducibly Human Research Series

Frictional

Measuring the Struggle: Process Friction Traces as Independent Evidence of Genuine Learning in the Age of Generative AI

Nik Bear Brown Associate Teaching Professor of Computer Science and AI, Northeastern University
Founder, Humanitarians AI (501(c)(3)) and Bear Brown and Company

Preprint · skepticism.ai · theorist.ai · irreduciblyhuman.com

Abstract

The artifact — the essay, the examination, the project, the recorded performance — has served as the primary evidence of genuine learning for as long as formal education has existed. This evidential relationship rested on a causal assumption: only the process of genuine learning could produce the artifact. Generative AI has broken this assumption. The artifact can now be produced without the process.

This paper argues that process friction traces — the behavioral signatures that genuine human cognitive engagement leaves in observable data — constitute an independent evidence stream that compensates for what the artifact can no longer reliably tell us alone. We develop the Genuine Learning Probability (GLP) framework, a probabilistic methodology specifying seven observable friction components: temporal engagement pattern, error trajectory coherence, cross-context transfer, uncertainty calibration, social knowledge texture, retrieval strength decay signature, and scaffolding response curve.

The paper makes four claims:

01Genuine human learning at each cognitive tier leaves characteristic friction traces arising from the irreducible complexity of the neurological processes that constitute learning.
02These traces are partially independent of artifact quality and provide information about genuine engagement that the artifact alone does not contain.
03The composite GLP score is substantially more robust to gaming than any individual component because manufacturing all friction traces simultaneously without performing the underlying cognitive work is essentially equivalent in cost to performing that work.
04The GLP framework is not a replacement for artifact-based assessment but a formally specified second evidence stream that instructors can combine with artifact evidence in whatever proportion their professional judgment warrants.

Keywords: assessment · generative AI · learning analytics · friction traces · genuine learning probability · desirable difficulties · cognitive load theory · ensemble learning · formative assessment · process-based evidence

Introduction

The artifact used to be proof of the process. It no longer is.

For most of human educational history this statement would have been meaningless. The essay demonstrated thinking because only thinking could produce the essay. The proof demonstrated mathematical understanding because only mathematical understanding could produce the proof. The artifact and the process that produced it were causally coupled tightly enough that measuring one was effectively measuring both.

Generative AI has severed this coupling. A well-structured essay can now be produced in seconds by a system that has performed none of the cognitive work the essay was designed to evidence. A correct proof can be generated by a system with no mathematical understanding in any sense that matters for the student's development. The artifact exists. The process that should have produced it did not occur.

This decoupling is not a temporary condition that will resolve as detection technology improves. It is a permanent structural change driven by the continuous improvement of generation technology. The forensic window — the period during which artifact analysis can reliably distinguish AI-generated from human-generated work — is closing sequentially across domains. In writing it is largely closed already. In code it is closing. In music production it is closing, as practitioners note that AI-generated audio can already be processed to be spectrographically indistinguishable from professional human recordings.

The institutional response to this decoupling has been predominantly technical — build better detectors, run more sophisticated stylometric analysis, require oral defenses of written work. These responses are not wrong. They are insufficient and impermanent. Any detector trained on current AI outputs becomes obsolete as generation technology advances. The arms race between generation and detection has a predictable winner.

This paper proposes a different response. Rather than building better artifact detectors, we propose measuring what the artifact used to be evidence of directly — the process of genuine learning itself. Process friction traces are the behavioral signatures that genuine human cognitive engagement leaves in observable data. They exist because genuine learning is a biological event — a cascade of neurological processes including dopamine-mediated prediction error signaling, BDNF-driven synaptic consolidation, and dendritic spine formation — and like all biological events it produces behavioral consequences that are traceable in principle and increasingly measurable in practice.

The paper's contribution is the Genuine Learning Probability (GLP) framework: a formally specified, probabilistic, tier-calibrated methodology for measuring process friction traces as an independent evidence stream alongside artifact-based assessment.

The Decoupling Problem

2.1The Causal Chain That Broke

Traditional assessment rests on an implicit causal model:

Genuine Learning → Cognitive Process → Artifact

The artifact was never the thing we cared about. We cared about the cognitive process — the schema formation, the conceptual development, the capacity for transfer. The artifact was valuable as evidence because it was causally downstream of the process.

Generative AI inserts a bypass:

AI Generation → Artifact

The artifact now has two causal pathways. One passes through genuine cognitive process. The other bypasses it entirely. The artifact can no longer be used to infer which pathway was taken.

2.2Why Detection Cannot Solve This

Artifact-based AI detection faces three structural limitations.

Temporally bounded. Every detection methodology is trained on current generation technology. Generation technology improves continuously. Detection trained on today's outputs fails on tomorrow's.

Wrong question. The educationally relevant question is not whether a human typed these words but whether a human developed this understanding. Artifact-based detection catches AI use but not the broader category of bypassed cognitive process that AI use exemplifies.

Perverse incentives. Students learn to game the detector rather than engage with the material. The simulation gets better over time as students share strategies.

2.3The Bjorkian Insight Applied

Robert and Elizabeth Bjork's foundational distinction between performance and learning is directly relevant here. Performance is the observable, often temporary fluctuation in behavior during or immediately after instruction. Learning is the more permanent change in knowledge that supports subsequent access and transfer in novel contexts.

The artifact measures performance. A student who used AI to produce an essay has performed well — the essay is good — but may have learned nothing. AI assistance is the limiting case: it maximizes performance while minimizing — potentially eliminating — the learning process.

What we need to measure is learning, not performance. Learning leaves traces that performance does not.

The Neurobiological Foundation of Friction Traces

3.1Why Genuine Learning Leaves Traces

Friction traces are not metaphorical. They are behavioral consequences of neurobiological events that constitute genuine learning.

When a learner encounters material that genuinely challenges their current mental model — material in the zone of proximal development, where intrinsic cognitive load is appropriately calibrated to their expertise — several molecular cascades are triggered:

Dopamine prediction error signaling. Dopamine neurons in the midbrain fire in response to prediction errors — discrepancies between what the learner expected and what they encountered. This phasic dopamine release initiates long-term potentiation — the strengthening of synaptic connections that is the physical substrate of memory formation. Without the friction of encountering something that violates current understanding, the teaching signal does not fire and synaptic change does not occur.

BDNF upregulation. Brain-Derived Neurotrophic Factor (BDNF) expression is upregulated as much as 2.8-fold during moderate cognitive challenge. BDNF drives the MAPK/ERK signaling pathways that support long-term memory consolidation. Under low-challenge conditions — the conditions that AI assistance creates — BDNF upregulation is minimal and consolidation is weak.

Dendritic spine formation. Structural growth of new synaptic connection sites increases by 37% under moderate cognitive load compared to low-load conditions. These spines are the physical locations where memories are stored. Their formation requires effortful engagement. Passive processing of AI-generated explanations does not trigger the load necessary for spine formation at rates sufficient for durable learning.

The behavioral consequences of these neurobiological events are the friction traces the GLP framework measures. An AI can produce the artifact without triggering any of these neurobiological events. It cannot produce the behavioral traces those events leave, because the events did not occur.

3.2The Storage-Retrieval Distinction and Its Measurement Implications

Bjork's New Theory of Disuse distinguishes storage strength — how thoroughly a memory is encoded and integrated with existing knowledge — from retrieval strength — how accessible the memory currently is.

The critical implication: high retrieval strength immediately after learning does not indicate high storage strength. A student who processed an AI explanation has high retrieval strength in the short term. But without the effortful encoding that builds storage strength, retrieval strength decays rapidly and the spacing effect — the benchmark of genuine learning — does not appear.

3.3The Fluency Trap as a Measurement Problem

The brain confuses perceptual fluency — the ease with which information is processed — with understanding. A student who reads a clear AI explanation experiences genuine fluency. The information flows smoothly. The feeling of understanding is real. The understanding is not.

In the AI era, a smooth, well-structured artifact may be evidence of borrowed certainty. A rough, searching artifact may be evidence of genuine engagement.

The Genuine Learning Probability Framework

4.1Foundational Definitions

Let S denote a student, C a concept or skill, and ℒ a learning episode spanning observation window Ω = [t₀, t₀ + τ]. Define the cognitive engagement state E as the set of neurological and behavioral processes activated during ℒ. E is genuine if it includes effortful retrieval, prediction error signaling, and schema construction. E is borrowed if the cognitive work was performed by an external system.

Genuine Learning Probability GLP(S, C) = P(E genuine | Y) where Y = (Y₁, Y₂, Y₃, Y₄, Y₅, Y₆, Y₇)

Three properties follow immediately. GLP is a property of the engagement process, not the artifact. GLP is probabilistic and continuous — the framework produces a score in [0,1] with explicit credible interval rather than a binary classification. GLP is tier-sensitive — the expected friction signature for genuine engagement differs across cognitive tiers.

4.2The Seven Components

Y₁ Temporal Engagement Pattern (TEP)

Genuine engagement produces characteristic time-on-task distributions correlated with item difficulty. Borrowed certainty decouples time from difficulty — the student spends time proportional to explanation length, not conceptual challenge.

Y₁ = corr(d, τ) d = item difficulty vector · τ = time-on-task vector

Genuine Positive correlation. Time tracks difficulty because cognitive effort is calibrated to material demands.

Borrowed Near-zero correlation. Time tracks output length, not conceptual difficulty.

Infrastructure: LMS clickstream data. Routinely collected, rarely analyzed for this purpose.

Y₂ Error Trajectory Coherence (ETC)

The reward prediction error mechanism produces coherent error evolution during genuine learning. Each error is a prediction violation that updates the mental model. Because updates are cumulative the error trajectory follows a path reflecting the concept's structure.

Let A = conceptual adjacency matrix Y₂ = Σ A(eᵢ, eᵢ₊₁) / Σ 1[eᵢ ≠ eᵢ₊₁]

Genuine Elevated Y₂. Errors follow conceptually meaningful developmental paths.

Borrowed Y₂ ≈ 0. Errors random with respect to conceptual adjacency — no coherent model evolving.

Infrastructure: Sequenced formative assessment with misconception-coded items.

Y₃ Cross-Context Transfer (CCT)

Transfer — applying knowledge in novel contexts — is the Bjorkian definition of learning. Schema formation produces representations that generalize across surface variations. Borrowed certainty produces surface representations tied to the specific context of the AI explanation.

Y₃ = 0.4 · ρ_near + 0.6 · ρ_far Far transfer weighted more heavily as the stronger signal of schema formation

Genuine Near-far gap is small. Schema enables generalization across surface variations.

Borrowed Large positive (ρ_near − ρ_far). Surface representation without schema.

Infrastructure: Deliberately designed transfer problem sets with near and far transfer items. Requires domain expertise in item design.

Y₄ Uncertainty Calibration (UC)

Genuine learning produces calibrated uncertainty — the student learns not just what is correct but what they know and do not know. Borrowed certainty produces systematic overconfidence — the student inherits the AI's confidence distribution without the knowledge base that would justify it.

Y₄ = 1 − [Σ(cᵢ − rᵢ)²] / B_ref cᵢ = expressed confidence · rᵢ = correctness · B_ref = reference Brier score

Genuine Confidence tracks actual performance. Y₄_dir ≈ 0.

Borrowed Positive Y₄_dir. Inherited confidence exceeds genuine performance.

Infrastructure: Confidence elicitation embedded in regular assessment through minor LMS quiz modifications.

Y₅ Social Knowledge Texture (SKT)

Genuine encounter with material leaves a characteristic texture in social and discursive contexts — specific confusions, particular connections to prior knowledge, questions that arose from genuine engagement. This texture cannot be manufactured without having had the experience.

Y₅ = (φ₁ + φ₂ + φ₃ + φ₄) / 8 φ₁ personal encounter · φ₂ below-surface · φ₃ productive uncertainty · φ₄ real-time development

Genuine Personal encounter markers, below-surface engagement, genuine uncertainty, position changes.

Borrowed Generic talking points, surface-level engagement, no traceable personal confusion.

Infrastructure: Structured discussion coding with trained raters. Most resource-intensive component; most diagnostically powerful for social dimensions.

Y₆ Retrieval Strength Decay Signature (RSDS)

The spacing effect is the benchmark of genuine learning. Borrowed certainty has no storage strength to retrieve. Performance decays monotonically and the spacing effect is absent.

Y₆ = ρ_treatment(t₃) − ρ_control(t₃) t₁ = immediate · t₂ = short delay · t₃ = long delay

Genuine Positive Y₆. Spaced retrieval benefits persist to long delay.

Borrowed Y₆ ≈ 0. Both groups show similar decay — neither has storage strength to retrieve.

Infrastructure: Spaced retrieval quizzes on previously covered material embedded in course design.

Y₇ Scaffolding Response Curve (SRC)

The Zone of Proximal Development is a structural property of a genuinely developing mental model. A student with genuine partial understanding has a ZPD — a region of near-competence that targeted scaffolding can activate. Borrowed certainty has no ZPD.

Y₇ = [ρ(h₁) − ρ(h₀)] / [ρ(h₂) − ρ(h₀)] h₀ = no hint · h₁ = partial structural hint · h₂ = full structural hint

Genuine Elevated Y₇. Partial hint nearly as useful as full hint — underlying structure exists.

Borrowed Low Y₇. Only full hint produces improvement — no partial structure to connect to.

Infrastructure: Structured hint experiments in formative assessment or tutoring contexts. Implementable through adaptive assessment platforms at scale.

The Ensemble Architecture

5.1Why Ensemble Rather Than Single Model

The seven GLP components have different statistical structures, different data types, and different failure modes. More importantly the components fail — can be gamed — in different ways. Y₁ is gamed by artificially distributing time. Y₃ is gamed by seeking transfer examples in the AI explanation. Y₅ is gamed by preparing discussion points. These are different strategies requiring different effort.

A student gaming all seven components simultaneously is performing work that approaches the cost of genuine engagement — at which point the gaming has become indistinguishable from learning in the only sense that matters.

5.2The Three-Layer Architecture

Layer 1 — Component models. Seven base models, one per component, each using the algorithm suited to that component's data structure. Y₁ temporal: survival analysis. Y₂ error trajectory: graph-based sequence model. Y₃ transfer: gradient boosting. Y₄ calibration: isotonic regression. Y₅ social texture: NLP model on coded discussion data. Y₆ retention: mixed effects longitudinal model. Y₇ scaffolding: causal inference model. Each outputs P(E genuine | Yᵢ) — a probability, not a classification.

Layer 2 — Tier-conditioned combination. Seven combination models, one per cognitive tier, that learn the optimal weighting of Layer 1 outputs conditional on which tier the learning activity is designed to develop. At Tier 5 causal reasoning, Y₃ transfer and Y₇ scaffolding receive highest weights. At Tier 3 social cognition, Y₅ social texture dominates.

Layer 3 — Meta-model. A final combination model produces the GLP score with credible interval. The meta-model handles missing components gracefully — when Y₅ is unavailable, it routes to remaining components and widens the credible interval to reflect reduced information.

GLP(S, C) ∈ [0, 1] with credible interval [GLP_lower, GLP_upper]

5.3The Instructor as Meta-Model

The ensemble produces a formally specified GLP score. The instructor receives this score alongside artifact quality and combines them into an overall assessment judgment. This combination is itself an ensemble — the instructor's combining function is the meta-model at the highest level.

A medical educator assessing clinical reasoning should weight process evidence heavily because borrowed clinical certainty is dangerous. A writing instructor whose goal is a publishable essay may weight artifact quality heavily because the essay is the deliverable. A mathematics educator should weight transfer performance heavily because transfer is the definition of mathematical understanding.

The GLP framework gives every instructor a formally specified, empirically grounded second evidence stream. How much weight they give it is their professional judgment. The paper provides the stream. The weighting belongs to the educator.

Tier Calibration

Each tier of the Irreducibly Human framework engages distinct cognitive and neurological processes. The expected friction signature for genuine engagement differs accordingly. GLP measurement without tier calibration conflates fundamentally different kinds of cognitive work.

Tier	Name	Primary Friction Signal	Notes
1	Pattern	Y₃	Least diagnostically useful — pattern recognition is precisely what AI performs most effectively. Focus on movement to higher tiers.
2	Embodied	Y₇Y₁	Standard components must be adapted to physical performance contexts. Friction traces are proprioceptive and motor.
3	Social	Y₅	Social texture of genuine Tier 3 engagement is particularly resistant to manufacturing — requires the phenomenology of genuine surprise or change from actual contact with another person's perspective.
4	Metacognitive	Y₄Y₂	Calibration that develops over the course — the student becoming better at knowing what they know — is the characteristic trajectory of genuine Tier 4 development.
5	Causal	Y₃Y₇	Causal understanding is precisely what enables transfer across surface variations where pattern matching fails.
6	Collective	Group-level	Individual GLP measures are structurally inadequate. Group-level analysis required — contribution patterns, position changes, emergent outcomes exceeding individual contributions.
7	Wisdom	Contextual	Standard assessment almost entirely inappropriate. Friction traces appear in decision histories, specificity of uncertainty expression, and the phenomenology of genuine ethical difficulty.

Validation

7.1Labeled Corpus Construction

The confirmed genuine engagement class draws from students with documented engagement trajectories — LMS data showing characteristic temporal patterns, formative assessment sequences showing coherent error development, discussion records showing personal encounter texture. The confirmation criterion is convergent evidence across multiple components, not any single indicator.

The confirmed borrowed certainty class draws from documented cases including students who submitted AI-generated work later acknowledged in academic integrity proceedings, experimental conditions where students were explicitly instructed to use AI without engaging, and cases where the combination of high artifact quality and near-zero performance on unannounced transfer assessments is sufficiently extreme to constitute strong inferential evidence.

7.2Calibration Assessment

A well-calibrated probabilistic model produces probability estimates that match empirical frequencies. Among cases scored at GLP = 0.7, approximately 70 percent should be confirmed genuine engagement. Miscalibration in the ambiguous range — cases scoring between 0.3 and 0.7 — is expected and is not a defect. It accurately represents genuine uncertainty about cases where some friction traces are present and others are absent.

7.3The Information Gain Test

The central empirical claim is that process measurement adds independent information about genuine learning beyond what the artifact provides. Let A denote artifact quality score and G denote GLP score. The information gain from adding process measurement:

ΔI = I(θ; A, G) − I(θ; A) θ = confirmed genuine learning status from labeled corpus

Positive ΔI confirms that the process evidence stream is not redundant with the artifact evidence stream — it provides independent information about genuine learning.

The Arms Race Problem and Why This Framework Is Different

Every AI detection methodology faces the objection that adversaries will adapt. This objection applies with different force to artifact-based detection versus process-based measurement.

For artifact-based detection the objection is decisive. The arms race between generation and detection has a structural winner: generation technology improves continuously, detection technology is always calibrated to past outputs. The window closes.

For process-based measurement the objection is substantially weaker for two reasons.

First, manufacturing convincing friction traces across all seven components simultaneously without performing the underlying cognitive work is essentially equivalent in cost to performing the underlying cognitive work. A student who spends genuine time on difficult material, makes and corrects errors in a conceptually coherent sequence, demonstrates transfer across novel contexts, maintains calibrated uncertainty, engages with genuine texture in discussion, shows the spacing effect in longitudinal assessment, and responds appropriately to scaffolding — has learned the material. At that point the gaming has become indistinguishable from learning in the only sense that matters.

Second, the framework is not trying to detect AI use. It is trying to measure genuine learning directly. An AI detector fails when AI outputs become indistinguishable from human outputs. A learning measure fails when borrowed certainty becomes indistinguishable from genuine learning — which requires the borrowed certainty to produce the same neurobiological events, the same schema formation, the same durable transfer. That is not AI defeating assessment. That is learning.

Discussion

9.1What This Paper Is Not Claiming

The paper is not claiming that artifacts are worthless. A well-argued essay still demonstrates something. The artifact has not become zero evidence. It has become insufficient as the sole evidence in a way it was not before.

The paper is not claiming that all instructors must adopt the full GLP framework. The framework specifies what is possible. Practical implementation depends on institutional infrastructure, course design, and specific learning objectives.

The paper is not claiming that process measurement is always more informative than artifact measurement. In some contexts the artifact remains the most informative available evidence. The paper claims the process adds independent information — not that it always adds more information than the artifact.

The paper is not claiming that the GLP framework can replace the instructor's judgment. The instructor is the meta-model. The framework provides inputs. The weighting is a professional judgment that requires knowledge of the student, the course, the domain, and the stakes that no formal framework can substitute for.

9.2The Institutional Design Implication

If process friction traces are independent evidence of genuine learning, then institutional assessment infrastructure should be designed to make those traces observable. Longitudinal process documentation — portfolios that capture the history of engagement — becomes primary rather than supplementary evidence. Embedded formative assessment — frequent low-stakes measurement of the GLP components — becomes the primary data source. Developmental trajectory as credential — evidence of growth from novice toward expertise — becomes the primary evidence that credentialing decisions rest on.

The argument of this paper is that the AI decoupling makes these urgently necessary rather than merely desirable — that the institutional cost of not building process-observable assessment infrastructure is now the progressive obsolescence of artifact-based credentialing.

9.3The Ethics of Process Observation

Process observation for the purpose of supporting learning is categorically different from process observation for the purpose of surveillance. The distinction is in what the data is used for and who controls it.

Process measurement that is used to give students better feedback, to identify students who are struggling before they fail, and to inform instructional design is educationally defensible. Process measurement that is used to penalize students for AI use they may not have recognized as problematic, to build permanent behavioral records, or to make high-stakes credentialing decisions without human review is not. The GLP framework is designed to support the first use. Institutional implementation must actively guard against the second.

Conclusion

The artifact has been decoupled from the process that used to produce it. The decoupling is irreversible, accelerating, and domain-general. Any assessment infrastructure built solely on artifact analysis has a shrinking evidentiary lifespan.

Process friction traces — the behavioral signatures that genuine human cognitive engagement leaves in observable data — are an independent evidence stream that compensates for what the artifact can no longer reliably tell us alone. They exist because genuine learning is a biological event that produces behavioral consequences. They are measurable through learning management system data, formative assessment sequences, structured observation, and longitudinal performance tracking.

The Genuine Learning Probability framework formalizes this evidence stream as a probabilistic, tier-calibrated, ensemble-based measurement methodology. It does not replace artifact assessment. It gives artifact assessment a partner. The instructor determines the weighting.

The crisis of evidence facing educational institutions is not a technical problem requiring a better AI detector. It is an epistemological problem requiring a new evidence infrastructure — one built on the process of learning rather than its products.

The artifact used to be proof of the process. It no longer is. Now we must measure the struggle itself.

The GLP framework implementation, labeled corpus, and assessment design templates are published openly at irreduciblyhuman.com. The methodology is not a secret.

—

Acknowledgments

This research was conducted through Humanitarians AI (501(c)(3)). The Irreducibly Human curriculum series provided the seven-tier taxonomy of human cognitive capacities that grounds the tier calibration methodology. The Frictional.xyz theoretical framework, currently in development, will provide the unified cross-domain treatment of friction traces as evidence of authentic human cognitive engagement of which this paper is one instantiation.

—

References

Bjork, R.A., and Bjork, E.L. (1992). A new theory of disuse and an old theory of stimulus fluctuation. In A. Healy, S. Kosslyn, and R. Shiffrin (Eds.), From Learning Processes to Cognitive Processes: Essays in Honor of William K. Estes (Vol. 2, pp. 35–67). Erlbaum.
Bjork, R.A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe and A. Shimamura (Eds.), Metacognition: Knowing about Knowing (pp. 185–205). MIT Press.
Brown, N.B. (2026). Measuring the Friction: A Probabilistic Framework for Detecting Graph Contamination in Music Streaming Platforms. Musinique Research Trilogy, Paper III.
Craik, F.I.M., and Lockhart, R.S. (1972). Levels of processing: A framework for memory research. Journal of Verbal Learning and Verbal Behavior, 11(6), 671–684.
Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285.
Vygotsky, L.S. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.