Georg's Blog

Technology, leadership, and the digital frontier

September 1, 2025

on Arxiv

Pretraining on the Test Set Is All You Need

One of the earlier papers that conclusively showed that AI benchmarks primarily measure memorisation/training data expansion, explaining benchmaxxing before it became super popular.

Pretraining on the Test Set Is All You Need

Inspired by recent work demonstrating the promise of smaller Transformer-based language models pretrained on carefully curated data, we supercharge such approaches by investing heavily in curating a novel, high quality, non-synthetic data mixture based solely on evaluation benchmarks. Using our novel dataset mixture consisting of less than 100 thousand tokens, we pretrain a 1 million parameter transformer-based LLM \textbf{phi-CTNL} (pronounced ``fictional”) that achieves perfect results across diverse academic benchmarks, strictly outperforming all known foundation models. \textbf{phi-CTNL} also beats power-law scaling and exhibits a never-before-seen grokking-like ability to accurately predict downstream evaluation benchmarks’ canaries.

arxiv.org

← Back to all posts