The Mirror Test: What PredictabilityBERT Actually Reveals About Who Gets to Sound "Proficient"

The number is stark: 0.80. That’s how closely BERT—a neural network trained on Wikipedia—agrees with human experts rating spoken English proficiency. Not “somewhat aligned.” Not “moderately correlated.” Point-eight-zero. For context, that’s the correlation between SAT scores and first-year college GPA. Between height and weight in adults. Between what one twin scores on an IQ test and what the other twin scores.

When Langdon Holmes reports this finding in “The Hidden Link Between Language Learners and Language Models,” he frames it as validation. His measure, PredictabilityBERT, captures something real about language proficiency because it mirrors expert judgment. The logic appears sound: if the best human raters and the machine agree, both must be tracking truth.

But correlation is symmetric. The causality could run the other direction.

What if expert raters have spent years absorbing the same distributional patterns BERT learned from Wikipedia and books? What if institutional assessment has crystallized around specific language norms—formal, prestige, overwhelmingly reflecting published written English—to such a degree that human judgment and machine prediction converge not because both measure proficiency, but because both measure conformity to the same institutional standard?

This reading fits the data just as well. And it reframes everything.

What the Machine Actually Learned

BERT wasn’t trained on “English.” It was trained on Wikipedia and books—a specific corpus reflecting specific communities of practice. Wikipedia editors. Published authors. Academic writers. Journalists. The language in that corpus isn’t neutral or universal. It’s the language of people with access to publication, with education in Standard English conventions, with cultural proximity to institutions that define “correct” usage.

When BERT predicts “strongly” for “I ___ believe,” it’s not recognizing some Platonic ideal of English collocation. It’s reproducing the statistical patterns in texts written by people whose language already carried institutional validation. “Strongly believe” appears frequently in that corpus because people who write encyclopedia entries and publish books use that phrase. People who don’t—who might say “really believe” or “totally believe” or construct belief claims differently based on their dialect or register—don’t appear in BERT’s training data at the same rates.

PredictabilityBERT thus measures proximity to a particular variety: formal, written, institutionally validated English. A student from a community where African American English vernacular is spoken will score lower not because their language is less proficient but because their collocation patterns don’t match Wikipedia’s. A creative writer who deploys unconventional phrasing for rhetorical effect will score lower not because they lack competence but because BERT can’t distinguish skilled deviation from error.

The measure doesn’t ask: Can this student communicate effectively? It asks: Does this student sound like Wikipedia?

Holmes acknowledges this obliquely in the limitations section: “Any measure of ‘conventional’ language use depends on what community’s conventions we’re referencing.” Then he moves on. But this isn’t a caveat to note and set aside. It’s the entire game.

The Standardized Coefficient That Doesn’t Make Sense

Regression analysts will notice something peculiar in the results. PredictabilityBERT has a standardized coefficient of 1.142 in the multivariate model predicting TOEFL scores. For a standardized coefficient, this is strange. Standardization puts all predictors on the same scale—mean zero, standard deviation one. A coefficient above 1.0 means a one-standard-deviation increase in predictability corresponds to more than a one-standard-deviation increase in proficiency score.

This can happen legitimately when one predictor captures variance that other predictors suppress. But it raises questions.

Look at what else is in the model: bigram frequency. PredictabilityBERT is partly derived from bigram patterns—that’s how BERT learns collocation probabilities. If both measures are in the model, how much multicollinearity exists? Are we counting the same variance twice with different labels?

More troubling: lemma bigram frequency has a negative coefficient (β = -0.098**). Higher bigram frequency predicts lower proficiency. Students who use more frequent word pairs score worse. This contradicts decades of research showing proficient learners rely on high-frequency multiword units. It contradicts the theoretical framework usage-based theory provides. Holmes offers no explanation.

Something in the model needs scrutiny. Either multicollinearity is distorting the coefficients, the measures capture overlapping constructs in ways the analysis doesn’t disentangle, or there’s a specification error. Without diagnostics—variance inflation factors, residual plots, sensitivity analyses testing alternative model structures—we can’t evaluate whether that impressive R² of 0.585 reflects construct validity or statistical artifact.

The paper presents these coefficients as evidence the measure captures something real. They might instead signal that something in the analysis has gone wrong.

The Non-Linearity That Invalidates the Statistics

Buried in the limitations section is an admission that should reshape the entire interpretation: predictability might relate non-linearly to proficiency. Beginners show high predictability because they rely on memorized chunks. Intermediates show lower predictability because they’re experimenting with language, making creative errors. Advanced learners show high predictability again because they’ve internalized conventional patterns.

If this is true, the statistical analysis is misspecified.

Correlation coefficients assume linear relationships. Regression models assume linear relationships. If the true relationship is U-shaped—high predictability at both ends of the proficiency spectrum for different reasons—then pooling all learners into a single correlation collapses meaningful variation. The r = 0.67-0.80 becomes an average across groups where the measure means different things.

This isn’t a “future research direction.” It’s a validity threat to the current findings. If predictability operates differently across proficiency levels, then:

Simple correlations mislead by averaging incompatible groups
The regression model pools students it shouldn’t pool
The interpretation “more proficient equals more predictable” is wrong

The appropriate analysis would stratify by proficiency level, test for interaction effects, and model non-linearity explicitly. The appropriate interpretation would acknowledge that whatever the measure captures, it captures different constructs at different stages of development.

Holmes notes the non-linearity possibility, then proceeds as if the linear statistics are definitive. This is intellectually incoherent. You can’t report linear correlations, acknowledge they might not apply, and then base your educational applications on those correlations.

What Expert Raters Actually Learned

The essay presents the 0.67-0.80 correlation between PredictabilityBERT and expert ratings as validation: the measure captures what experts perceive as proficiency. But consider the reverse interpretation.

Expert raters don’t evaluate language in a vacuum. They’re trained. They read rubrics. They calibrate against exemplars. They discuss rating criteria until achieving inter-rater reliability. This training process teaches them—implicitly or explicitly—what institutional assessment values. They internalize standards that themselves reflect historical decisions about what counts as “good” academic English.

Those standards are grounded in the same texts BERT trained on. Writing handbooks cite published authors. Assessment rubrics reference “conventional usage.” Exemplar essays that score highly tend toward formal, edited prose—the register Wikipedia approximates. Through years of rating student writing against institutional norms, expert raters have absorbed distributional patterns similar to BERT’s.

The 0.80 correlation thus might not validate PredictabilityBERT as a proficiency measure. It might reveal that institutional assessment has converged on valuing the same distributional patterns Wikipedia encodes. Both the human raters and the machine have learned the same implicit standard.

This doesn’t make expert ratings “wrong.” Institutional assessment serves real gatekeeping functions—TOEFL scores determine university admissions, professional certifications, immigration eligibility. There are legitimate reasons to assess whether students can produce the language variety that institutions expect.

But we should call it what it is: assessing conformity to institutional language norms. Not assessing “proficiency” in some universal sense. Not measuring communicative competence. Not evaluating whether someone can achieve their communicative goals in diverse contexts. Measuring whether they sound like Wikipedia.

The Spoken/Written Asymmetry Nobody Explains

One of the most interesting findings gets the least scrutiny: PredictabilityBERT correlates more strongly with spoken proficiency (r = 0.80) than written (r = 0.67).

Holmes’s explanation: “When we speak, we rely more heavily on automatized, formulaic language patterns because we don’t have time to plan and revise.”

This sounds plausible. But it’s a just-so story—post-hoc reasoning that fits the data without independent support. Alternative explanations are equally compatible:

Transcription effects: Spoken data was transcribed before analysis. Transcription removes disfluencies, false starts, repairs, overlapping speech—all the messiness of actual spoken language. This “cleaned” speech might be artificially predictable compared to authentic spoken production. Written data preserves every cross-out, every revision visible in edit history, every awkward phrase the writer didn’t catch. The comparison isn’t spoken vs. written language. It’s transcribed-and-cleaned spoken vs. as-written texts.

Rating criteria differences: Judges rating spoken vs. written proficiency apply different rubrics. Spoken proficiency assessments might weight fluency, collocation knowledge, and automaticity more heavily. Written assessments might weight grammatical accuracy, lexical range, and organizational coherence. If predictability aligns more closely with spoken rubric criteria, that would produce stronger correlation without requiring different cognitive processes.

Register variation: The spoken corpus might contain more conversational language, where conventions are stricter and creative deviation less valued. The written corpus might contain more academic prose, where unconventional phrasing (within bounds) can signal sophisticated thinking. Different register expectations would produce different predictability profiles.

The essay picks the explanation that supports its theoretical narrative—speech requires automatized patterns, proficiency means having automatized patterns, therefore spoken language shows the relationship more clearly. But this is circular reasoning disguised as explanation.

If time pressure were the mechanism, we’d predict that timed writing tasks (in-class essays, standardized test prompts) would show higher predictability correlations than untimed tasks (revised portfolios, research papers with drafting time). Holmes provides no such analysis. The data might not support the mechanism he proposes.

What TOEFL Correlation Actually Validates

The strongest empirical claim is that PredictabilityBERT explains 59% of variance in TOEFL writing scores, outperforming traditional lexical measures. This is framed as validation.

But TOEFL measures academic English for institutional gatekeeping. It assesses whether international students can produce the discourse style U.S. universities expect: formal register, explicit transitions, conventional argumentation patterns, edited prose. These are legitimate assessment targets for that specific purpose. They’re not coextensive with “proficiency” writ large.

A journalist writing compelling investigative reports might score poorly on TOEFL metrics. A novelist creating distinctive narrative voice might score poorly. A community organizer writing persuasive calls to action might score poorly. These are all forms of proficient language use deployed for different communicative purposes in different discourse communities.

When PredictabilityBERT predicts TOEFL scores, it demonstrates that Wikipedia/books corpus patterns align with TOEFL’s institutional norms. This is useful for understanding what TOEFL values. It doesn’t tell us about language proficiency beyond that institutional context.

The validation is circular: the measure aligns with an assessment that itself embodies specific institutional values about what counts as “good” academic English. If TOEFL has cultural biases—and decades of applied linguistics research suggests it does—PredictabilityBERT inherits those biases. If TOEFL rewards conformity to particular discourse norms while penalizing equally effective alternatives, the automated measure reproduces that reward structure.

High correlation with TOEFL validates that the measure captures whatever TOEFL captures. It doesn’t validate that TOEFL captures proficiency.

The Educational Applications That Assume Too Much

Holmes proposes four practical applications: rethinking vocabulary teaching, automated feedback tools, oral proficiency assessment, and holistic evaluation. These sound reasonable. But each rests on assumptions the research hasn’t validated.

Rethinking vocabulary teaching assumes students benefit from explicit instruction in conventional collocations. Usage-based theory—the very framework Holmes invokes—suggests patterns are best learned through exposure and meaningful use, not explicit metalinguistic teaching. If proficiency comes from internalizing distributional patterns through authentic language experience, teaching might be less effective than providing rich, varied input. No evidence is presented that predictability-focused instruction improves learning outcomes.

Automated feedback assumes students benefit from knowing their phrasing is unconventional. But this could discourage creative language use, reinforce institutional language ideologies, disadvantage students from non-mainstream communities, or reduce writing to pattern matching rather than meaning construction. A creative writer who uses “powerful” instead of “strongly” for rhetorical effect shouldn’t receive feedback that their word choice is “wrong” just because it’s less predictable. Without controlled studies showing such feedback improves performance on external criteria—not just increasing predictability scores—the application is premature.

Oral proficiency assessment makes sense if the goal is measuring automaticity and collocation knowledge specifically. But reducing spoken proficiency to pattern matching ignores other crucial dimensions: pronunciation, pragmatic appropriateness, interactional competence, repair strategies, discourse management. The measure might complement existing assessments. It shouldn’t replace them.

Holistic assessment incorporating multiple dimensions is uncontroversial—all assessment experts already recognize proficiency is multidimensional. But the claim that predictability represents a dimension distinct from existing measures needs scrutiny. If expert raters already weight conventionality (correlation = 0.67-0.80), the automated measure adds efficiency, not validity. It quantifies what raters already value. It doesn’t capture something they’re missing.

The applications are presented as implications of the research. They’re actually aspirations requiring their own validation studies.

What Holmes Actually Discovered

Stripped of interpretive overreach, here’s what the research demonstrates:

A measure can be constructed that quantifies alignment between student language and BERT’s probability distributions learned from Wikipedia/books. This measure correlates 0.67-0.80 with expert ratings in institutional assessment contexts. In multivariate models, it explains variance beyond frequency-based metrics. The correlation is stronger for spoken than written language for reasons not yet explained.

These are legitimate contributions. They tell us that expert raters value conventional patterns, that BERT’s training corpus overlaps with what experts value, that automated measurement of corpus-alignment is feasible, and that the relationship differs across modalities.

What the research does not demonstrate:

That proficiency consists of pattern recognition
That human cognition operates through BERT-like mechanisms
That the measure assesses communicative competence rather than institutional conformity
That predictability-based instruction improves learning
That the measure is valid across registers, dialects, and communities
That correlation with institutional assessments validates the construct as “proficiency”

Holmes conflates these repeatedly. The shift from “correlates with expert ratings” to “measures proficiency” to “reveals something essential about language learning” happens without argument, carried by the authority of high correlations and impressive statistics.

The Question Nobody Asked

Who benefits from defining proficiency as conformity to Wikipedia patterns?

Not students from non-mainstream language communities whose dialects have different collocation norms. Not students whose cultural backgrounds value different rhetorical structures. Not students whose cognitive styles favor creative over conventional expression. Not students whose communicative goals require registers different from formal written prose.

The measure benefits students who have extensive exposure to prestige written English, who come from communities where Standard English is modeled, who have been trained in institutional discourse norms, who are comfortable reproducing conventional patterns.

This isn’t value-neutral assessment. It’s linguistic gatekeeping with better math.

Holmes presents the tool as democratizing—automated feedback for all students—and scientifically rigorous—strong correlations with expert judgment. But scientific rigor in service of unexamined ideology is not neutrality. It’s ideology with empirical validation.

If Wikipedia patterns are the target, students must approximate Wikipedia style. Students whose language doesn’t approximate that style—because of dialect variation, creative innovation, register mismatch, or cultural difference—will be assessed as less proficient regardless of their communicative competence. The measure doesn’t ask whether students can achieve their communicative goals. It asks whether they sound like encyclopedia editors.

What the Correlations Actually Reveal

The deepest finding is one the essay never states explicitly: institutional assessment of language proficiency has crystallized around specific distributional patterns to such a degree that human expert judgment and machine statistical approximation now converge at r = 0.80.

This is profound. Not as validation of an assessment tool. As revelation of how institutions reproduce their own norms.

Expert raters have internalized patterns reflected in Wikipedia and books. TOEFL rewards alignment with those patterns at 59% explained variance. PredictabilityBERT quantifies that alignment automatically. The entire system—human raters, standardized tests, automated measures—has achieved remarkable consistency in valuing one particular variety of English: formal, written, institutionally validated.

This could be the starting point for critical assessment research. What ideologies do these patterns reflect? Which student populations are systematically advantaged or disadvantaged? How might assessment evolve to value linguistic diversity? What happens when the patterns in BERT’s training corpus shift as Wikipedia’s editor demographics change?

Instead, the essay takes institutional norms as given and builds a tool to measure alignment more efficiently. It automates without interrogating. It achieves precision without examining what precisely is being measured.

The research is rigorous within its assumptions. The assumptions are what need scrutiny.

When the machine becomes the mirror, we must ask: What exactly are we seeing reflected? BERT doesn’t show us language proficiency. It shows us which language patterns Wikipedia encodes. Expert raters don’t assess universal competence. They assess alignment with institutional expectations that themselves have history, culture, and politics.

The 0.80 correlation reveals consensus. What it doesn’t reveal is whether that consensus tracks something real about language ability or something real about how institutions define acceptability.

This is the research Holmes actually conducted. Not a study of proficiency. A study of how proficiency gets constructed through institutional assessment practices. Not a discovery about language learning. A discovery about language gatekeeping.

The difference matters. Because once we recognize that automated measures don’t assess proficiency but assess institutional conformity, we can ask better questions. Not “How can we measure proficiency more efficiently?” but “Whose language patterns are we valorizing and why?” Not “What does BERT reveal about learning?” but “What does BERT’s correlation with expert ratings reveal about what experts have been trained to value?”

The answers might be more interesting—and more honest—than the ones the essay provides.

Tags: PredictabilityBERT assessment critique, institutional language ideology, BERT corpus bias, second language proficiency validity, linguistic gatekeeping mechanisms