New 13B AI Model Trained Exclusively on Vintage Literature
- •Researchers introduce Talkie, a 13B parameter model trained solely on pre-1931 historical English text.
- •The project investigates if models can 'reason' within specific historical knowledge cutoffs.
- •Talkie employs a base model with an open copyright dataset and synthetic fine-tuning for conversation.
In a fascinating experiment at the intersection of history and machine learning, researchers have unveiled 'talkie,' a 13-billion parameter language model trained exclusively on English text published before 1931. While most modern AI models are trained on the vast, chaotic expanse of the modern internet—complete with its biases and anachronistic knowledge—this project seeks to establish a clean, 'vegan' corpus that operates entirely within a pre-digital worldview. By stripping away modern technological context, the developers aim to explore whether an AI can learn to reason and respond as if it were existing in a different century.
The technical challenge here is significant. How do you teach a machine to follow instructions and converse effectively without exposing it to the 'contamination' of modern internet slang or current world knowledge? The team utilized a base model trained on 260 billion tokens of historical data, then fine-tuned it using instruction-response pairs extracted from period-accurate sources like cookbooks, dictionaries, and etiquette manuals. They also employed synthetic prompts to help the model learn to summarize and engage in dialogue, though they had to leverage modern models like Claude to generate these training pairs, creating a recursive irony in the training process.
One of the most intriguing research objectives involves the concept of 'knowledge cutoffs.' The team is testing whether this historical model can independently discover scientific or mathematical breakthroughs that occurred after 1931, essentially trying to see if the model can replicate the cognitive trajectory of historical figures like Albert Einstein. If a model has never read about General Relativity, could it theoretically derive the concept from the historical data it possesses? It is a test of pure reasoning capability versus mere information retrieval.
For students interested in the ethics of AI training, this project offers a refreshing perspective. It directly addresses the problem of data provenance and copyright by utilizing only public-domain historical archives. While the developers acknowledge the difficulty of avoiding modern 'influence' in their fine-tuning process, the ultimate goal is to create a fully bootstrapped system where vintage models can self-correct and evaluate their own outputs. This is a critical step toward understanding how much of our current AI intelligence is structural logic versus simply regurgitating modern internet culture.