When a Model Fails Across Populations, It Was Never Learning the Disease

In 2026 a group led by Rasmussen ran a clean version of an experiment the field mostly avoids running, and the result should change how you read any brain-model accuracy number.

They took five Parkinson’s disease EEG cohorts, each from a different clinical or research site in the United States: New Mexico, Iowa, San Diego, Portland, and South Dakota. Then they asked the question single-site papers never ask. If you train a detector on some of these sites and test it on the others, does what it learned actually transfer? They enumerated the ways to train on some cohorts and test on the rest, seventy-five directional evaluations in all, with a single model architecture held fixed so that any difference in the results came from the data and not from the model.

Two findings do the work.

Transfer is asymmetric, and the direction is a clue

Cross-site transfer did not fail uniformly. It failed asymmetrically, and which way it failed depended on where a cohort sat in the geometry of the data rather than on anything about the disease.

One cohort, San Diego, occupied a structurally central position. Train on it and the model generalized well to the others, but train anywhere else and the model did poorly when tested on it. Another cohort, South Dakota, was the mirror image: nearly everyone scored well when testing on it, yet a model trained on it generalized poorly to everywhere else. Same disease, same kind of recording, opposite transfer behavior, set entirely by how each cohort sits relative to the rest. A model’s success was a fact about the arrangement of the datasets, not about Parkinson’s.

You can read the site off the model more clearly than the disease

That is suggestive. The next result settles it.

Take the trained model, freeze it, and pull out the internal representation it computes for each recording. Now train a plain classifier on those representations to predict two separate things: which site the recording came from, and whether the patient has the disease. Whatever that classifier can read off is information the model chose to encode.

It could identify the recording site with over ninety percent accuracy and an ROC-AUC above ninety-seven, the cohorts falling into compact, well-separated clusters. The site is written into the representation far more legibly than the diagnosis is. In the authors’ own words, a model trained across sites “can internalize cohort-specific structure, hardware fingerprints, preprocessing artifacts, or demographic confounds that overshadow disease-relevant neurophysiology.”

Why this is a finding, not a bug

Put the two together and the argument closes.

Inside a single site, the disease and the recording context are perfectly confounded. Everyone diagnosed was recorded on the same amplifier, in the same clinic, under the same protocol, so a model can reach high accuracy by learning the amplifier and never touching the neurophysiology. Nothing inside that one dataset can tell you which it did. Cross-site testing is the only thing that pulls the two apart, and when the model falls over on a new site, that fall is the measurement. It is telling you the model learned the context, not the disease. As the paper puts it, high accuracy on one cohort “does not distinguish disease-relevant features from site-specific shortcuts.”

This is why cross-population failure should not be filed as an engineering problem to be patched with more data or a better optimizer. It is a finding about what the model was ever looking at. A model that fails across populations may never have been learning the disease at all, and within-population accuracy cannot reassure you, because within-population accuracy is exactly the number the shortcut inflates.

The part the paper does not say, and I do

I want to be careful about the boundary here, because it is the difference between reporting and editorializing.

This was five sites in one country, with one model, and even that was enough to break transfer: a hospital in San Diego and a lab in South Dakota were far enough apart in signal space to defeat generalization. Now widen the gap. The large pretrained brain models, the ones the field calls foundation models, are trained overwhelmingly on data from a small number of wealthy countries. The populations furthest from that training distribution are the clinics that were never in the corpus at all, and by the logic of this study those are exactly where the site-shortcut failure will be largest, and exactly where a confident wrong answer costs the most.

The paper does not make that claim. It studied Albuquerque and Iowa City, not the world. I make it, because it is why I keep working on non-Western brain data instead of the convenient benchmarks, and because the mechanism this study demonstrates cleanly at the scale of five American cities is the same mechanism that will decide whether these systems work for everyone or only for the people whose data built them.

When a model breaks on a population it was not trained on, the field’s reflex is to treat the break as a gap to engineer shut. I read it the other way. The break is the experiment reporting its result. It tells you what the model was actually looking at, and the honest response is to listen to it before you deploy.