Soft Contamination Means Benchmarks Test Shallow Generalization
Study number couple-of-hundred by now showing that generative AI models are lossy storage and that benchmarks primarily measure storage and retrieval performance.
Which is nothing new, 2023’s PreTraining on the Test Set is All you Need [1] already explained this succinctly enough I made it my LinkedIn Banner.
These papers all say the same thing: Benchmark performance increases as the test set is added to the training data and the “progress” in intelligence hyped up by the benchmarks is primarily the progress of scraping, copying and stealing the worlds intellectual property corpus.
Or, much simpler:
We're comparing 7W human brains solving problems they've never seen in CodeForces to Gigawatt datacenter AI models running an agentic while loop to locate and reconstruct the closest matching solution from it's training data with massive compute, only to then run around and claim "It's smarter than a human" in the same way a librarian with access to all books in the world "is smarter"
Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpus.
Soft Contamination Means Benchmarks Test Shallow Generalization
If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.