Skip to content

How AI Learns

Understanding how AI models are trained helps you understand what you need to do to get your brand into their knowledge base. The training process has direct implications for your AI optimisation strategy.

In this guide

  • The three phases of LLM training
  • What sources LLMs learn from
  • Training cutoff dates and why they matter
  • How to get your brand into training data
10 min read Prerequisite: What is an LLM?

The Three Phases of Training

Creating an LLM isn't a single step. It's a multi-phase process, and each phase affects what the model knows and how it responds.

1

Pre-training

The model reads billions of web pages, books, articles, and code repositories. It learns patterns in language: how words relate to each other, what concepts mean, and how ideas connect.

Data sources: Common Crawl (web scrapes), Wikipedia, books, academic papers, GitHub, news articles

2

Fine-tuning

The model is trained on specific, curated datasets to improve its abilities in certain areas: following instructions, answering questions, or specialising in domains like coding or medicine.

Purpose: Makes the model better at specific tasks and reduces harmful outputs

3

RLHF (Reinforcement Learning from Human Feedback)

Human trainers rate the model's responses, and the model learns to produce outputs that humans prefer. This is why AI assistants are helpful and conversational rather than just technically accurate.

Impact: Shapes how the model presents information and which sources it tends to reference

Training Cutoff Dates

Every LLM has a "knowledge cutoff", which is the date when training data stops. Anything that happened after this date isn't in the model's base knowledge.

Model Training Cutoff Notes
GPT-4o June 2024 Can use web browsing for recent info
GPT-4.1 June 2024 Latest GPT-4 series model
Claude 3.5 Sonnet April 2024 Web search available since March 2025
Claude Sonnet 4 March 2025 Latest Claude model with web search
Gemini 2.0 Flash August 2024 Integrated with Google Search
Llama 3.1 December 2023 Open source, various deployments

Why Cutoffs Matter

If your brand launched or changed significantly after a model's cutoff date, the model won't know about it natively. You'll need to rely on models with web search, or wait for new model versions that include more recent training data.

What Content Gets Into Training Data?

Not all content on the internet makes it into training data. LLM providers are selective about what they include:

Likely Included

  • Wikipedia articles
  • Major news publications
  • Academic papers & research
  • High-authority blogs & publications
  • Government & educational sites
  • Product documentation

Often Excluded

  • Paywalled content
  • Login-required pages
  • Private social media
  • Sites blocking AI crawlers
  • Low-quality or spam sites
  • Image-heavy content (low text)

Getting Your Brand Into Training Data

While you can't directly submit content to training datasets, you can increase the likelihood of inclusion:

  1. 1

    Publish on authoritative platforms

    Guest posts on major publications, interviews, industry reports, and partnerships with established sites.

  2. 2

    Create Wikipedia-worthy content

    If your brand is notable enough for Wikipedia, that's a strong signal. Focus on building genuine notability.

  3. 3

    Allow AI crawlers access

    Don't block GPTBot or other AI crawlers in your robots.txt if you want to be included.

  4. 4

    Make content publicly accessible

    Remove paywalls on brand information, product details, and key content you want AI to know.

Technical Implementation

Your robots.txt configuration determines whether AI crawlers can access your content. Make sure you're not accidentally blocking the bots that gather training data.

Configure Robots.txt for AI

Key Takeaway

The training window is always in the past.

Focus on building a consistent presence on authoritative sources. When the next model version is trained, your brand will be better represented. Think of AI optimisation as a long-term investment.

Sources