Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

An excellent paper to add to the pile of papers proving that generative AI is primarily lossy compression, in this case via a clever use of finetuning.

The whole AGI / Intelligence humbug has, from the start, been about hiding that in the end Google (and everyone else) is just doing Google Books again, a project squashed in court before, just with a “transformative” retrieval UX glaze of paint and terminator narrative…. and getting away with it.

Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures […] undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.

The transformative nature here is to “solve” the knowledge economy in the same way Google Books wanted to - show and ultimately sell you the answer to your questions from copyrighted material, depriving the creators of their livelihoods.

Most of the “economic impact” aka efficiency gain currently ascribed to AI in fact come from the legal engineering - the centralization of knowledge, formerly decentralized and democratized over many actors in the economy onto single platforms and the implicit, by abstention, permission to ignore existing law out of fear of missing out of the future.

Add to it the destruction of limiting regulation by the same narrative - sustainability, non discrimination, reliability requirements and you got yourself “efficiency growth” most governments see as desirable because it moves economic measurements, but in fact just reallocates value from workers, creatives and everyone who has to suffer through degraded, unfounded societal functions ¹ to tech shareholders.

Removal of friction and centralization can increase efficiency - any government could centralize everyones works and make the available via an API too, but we’d probably call that communism unless they charged for it,…

The Technical angle

The semantic retrieval capabilities of the transformer are genuinely novel and exciting, however they are balanced out by massive, at present unfixable architecture flaws - chiefly prompt injection and hallucinations - that preclude the use of the technology without degrading reliability … aka the majority of use cases for GenAI only work if we accept degradation of quality, often in areas where quality is life defining: Healthcare, Job seeking, Crime, Justice, Finance.

There are massive costs to the AI boom as a result:

For example, an intellectual property hub like Singapore, having spent decades building up IP industry and rule of law as a major competitive advantage in the SEA region, is now pushing heads first into GenAI across all businesses , creating a massive chasm that will, eventually, have to be bridged - either at the cost of the rule of law or technology stance.

The tough choice for people and business owner alike is that they either have to bet on the death of IP - passing any IP though a transformer at present effectively strips it - or on the failure of the AI narrative.
More than a trillion dollar are betting on the former for now and most, especially those located in more law abiding locations outside Silicon Valley, can’t or won’t choose, leaving them stranded and unable to act.

The value extraction from AI is degrading everyday life ↩

Georg's Blog

Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

The Technical angle

Just a moment...