How AI Learns
Understanding how AI models are trained helps you understand what you need to do to get your brand into their knowledge base. The training process has direct implications for your AI optimisation strategy.
In this guide
- The three phases of LLM training
- What sources LLMs learn from
- Training cutoff dates and why they matter
- How to get your brand into training data
The Three Phases of Training
Creating an LLM isn't a single step. It's a multi-phase process, and each phase affects what the model knows and how it responds.
Pre-training
The model reads billions of web pages, books, articles, and code repositories. It learns patterns in language: how words relate to each other, what concepts mean, and how ideas connect.
Data sources: Common Crawl (web scrapes), Wikipedia, books, academic papers, GitHub, news articles
Fine-tuning
The model is trained on specific, curated datasets to improve its abilities in certain areas: following instructions, answering questions, or specialising in domains like coding or medicine.
Purpose: Makes the model better at specific tasks and reduces harmful outputs
RLHF (Reinforcement Learning from Human Feedback)
Human trainers rate the model's responses, and the model learns to produce outputs that humans prefer. This is why AI assistants are helpful and conversational rather than just technically accurate.
Impact: Shapes how the model presents information and which sources it tends to reference
Training Cutoff Dates
Every LLM has a "knowledge cutoff", which is the date when training data stops. Anything that happened after this date isn't in the model's base knowledge.
| Model | Training Cutoff | Notes |
|---|---|---|
| GPT-4o | June 2024 | Can use web browsing for recent info |
| GPT-4.1 | June 2024 | Latest GPT-4 series model |
| Claude 3.5 Sonnet | April 2024 | Web search available since March 2025 |
| Claude Sonnet 4 | March 2025 | Latest Claude model with web search |
| Gemini 2.0 Flash | August 2024 | Integrated with Google Search |
| Llama 3.1 | December 2023 | Open source, various deployments |
Why Cutoffs Matter
If your brand launched or changed significantly after a model's cutoff date, the model won't know about it natively. You'll need to rely on models with web search, or wait for new model versions that include more recent training data.
What Content Gets Into Training Data?
Not all content on the internet makes it into training data. LLM providers are selective about what they include:
Likely Included
- Wikipedia articles
- Major news publications
- Academic papers & research
- High-authority blogs & publications
- Government & educational sites
- Product documentation
Often Excluded
- Paywalled content
- Login-required pages
- Private social media
- Sites blocking AI crawlers
- Low-quality or spam sites
- Image-heavy content (low text)
Getting Your Brand Into Training Data
While you can't directly submit content to training datasets, you can increase the likelihood of inclusion:
- 1
Publish on authoritative platforms
Guest posts on major publications, interviews, industry reports, and partnerships with established sites.
- 2
Create Wikipedia-worthy content
If your brand is notable enough for Wikipedia, that's a strong signal. Focus on building genuine notability.
- 3
Allow AI crawlers access
Don't block GPTBot or other AI crawlers in your robots.txt if you want to be included.
- 4
Make content publicly accessible
Remove paywalls on brand information, product details, and key content you want AI to know.
Technical Implementation
Your robots.txt configuration determines whether AI crawlers can access your content. Make sure you're not accidentally blocking the bots that gather training data.
Configure Robots.txt for AIKey Takeaway
The training window is always in the past.
Focus on building a consistent presence on authoritative sources. When the next model version is trained, your brand will be better represented. Think of AI optimisation as a long-term investment.
Sources
- OpenAI Models Documentation: Model specifications and training information
- Claude Models Overview | Anthropic: Model training and capabilities
- Gemini Models | Google AI: Gemini model specifications
- LLM Knowledge Cutoff Dates | GitHub: Comprehensive cutoff date reference