How AI Learns

Understanding how AI models are trained helps you understand what you need to do to get your brand into their knowledge base. The training process has direct implications for your AI optimisation strategy.

In this guide

The three phases of LLM training
What sources LLMs learn from
Training cutoff dates and why they matter
How to get your brand into training data

10 min read Prerequisite: What is an LLM?

The Three Phases of Training

Creating an LLM isn't a single step. It's a multi-phase process, and each phase affects what the model knows and how it responds.

Pre-training

The model reads billions of web pages, books, articles, and code repositories. It learns patterns in language: how words relate to each other, what concepts mean, and how ideas connect.

Data sources: Common Crawl (web scrapes), Wikipedia, books, academic papers, GitHub, news articles

Fine-tuning

The model is trained on specific, curated datasets to improve its abilities in certain areas: following instructions, answering questions, or specialising in domains like coding or medicine.

Purpose: Makes the model better at specific tasks and reduces harmful outputs

RLHF (Reinforcement Learning from Human Feedback)

Human trainers rate the model's responses, and the model learns to produce outputs that humans prefer. This is why AI assistants are helpful and conversational rather than just technically accurate.

Impact: Shapes how the model presents information and which sources it tends to reference

Training Cutoff Dates

Every LLM has a "knowledge cutoff", which is the date when training data stops. Anything that happened after this date isn't in the model's base knowledge.

Model	Training Cutoff	Notes
GPT-4o	June 2024	Can use web browsing for recent info
GPT-4.1	June 2024	Latest GPT-4 series model
Claude 3.5 Sonnet	April 2024	Web search available since March 2025
Claude Sonnet 4	March 2025	Latest Claude model with web search
Gemini 2.0 Flash	August 2024	Integrated with Google Search
Llama 3.1	December 2023	Open source, various deployments

Why Cutoffs Matter

If your brand launched or changed significantly after a model's cutoff date, the model won't know about it natively. You'll need to rely on models with web search, or wait for new model versions that include more recent training data.

What Content Gets Into Training Data?

Not all content on the internet makes it into training data. LLM providers are selective about what they include:

Likely Included

Wikipedia articles
Major news publications
Academic papers & research
High-authority blogs & publications
Government & educational sites
Product documentation

Often Excluded

Paywalled content
Login-required pages
Private social media
Sites blocking AI crawlers
Low-quality or spam sites
Image-heavy content (low text)

Getting Your Brand Into Training Data

While you can't directly submit content to training datasets, you can increase the likelihood of inclusion:

1

Publish on authoritative platforms

Guest posts on major publications, interviews, industry reports, and partnerships with established sites.
2

Create Wikipedia-worthy content

If your brand is notable enough for Wikipedia, that's a strong signal. Focus on building genuine notability.
3

Allow AI crawlers access

Don't block GPTBot or other AI crawlers in your robots.txt if you want to be included.
4

Make content publicly accessible

Remove paywalls on brand information, product details, and key content you want AI to know.

Technical Implementation

Your robots.txt configuration determines whether AI crawlers can access your content. Make sure you're not accidentally blocking the bots that gather training data.

Configure Robots.txt for AI

Key Takeaway

The training window is always in the past.

Focus on building a consistent presence on authoritative sources. When the next model version is trained, your brand will be better represented. Think of AI optimisation as a long-term investment.

Sources

OpenAI Models Documentation: Model specifications and training information
Claude Models Overview | Anthropic: Model training and capabilities
Gemini Models | Google AI: Gemini model specifications
LLM Knowledge Cutoff Dates | GitHub: Comprehensive cutoff date reference