ChatGPT / GPT-4
ChatGPT is the most widely used AI assistant, with over 400 million weekly active users. Understanding how it retrieves information is essential for any AI optimisation strategy.
In this guide
- GPT-4 variants and their differences
- When ChatGPT uses web browsing
- Training data sources and cutoffs
- Optimisation strategies for ChatGPT
GPT-4o Cutoff
Context Window
Web Search
Weekly Users
Understanding the GPT-4 Family
OpenAI offers several GPT-4 variants, each with different capabilities:
GPT-4o ("omni")
The default model for ChatGPT Plus users. Handles text, images, and audio. Training cutoff: June 2024.
Web browsing: Available, triggered automatically or on request
GPT-4 Turbo
Newer training data (April 2024) but not the default. Better for recent information when browsing is off.
Web browsing: Available via API
GPT-4o-mini
Faster, cheaper variant. Often used in free tier and high-volume applications.
Web browsing: Limited in free tier
When Does ChatGPT Search the Web?
ChatGPT doesn't always use web search. Understanding when it does is crucial for your strategy:
Likely to Search
- Questions about recent events
- Current pricing or availability
- Queries with dates ("2024", "latest")
- User explicitly requests search
- Unknown entities or niche topics
Likely Uses Training Data
- • General knowledge questions
- • Well-known brands and products
- • Historical information
- • How-to and educational content
- • Coding and technical help
Key Takeaway
ChatGPT uses a hybrid approach.
You need to optimise for both scenarios: build presence in authoritative sources for training data inclusion, AND maintain strong SEO for when it searches. The model decides which approach to use based on the query.
What Training Data Includes
OpenAI has disclosed that GPT-4's training data includes:
- Web pages from Common Crawl (filtered for quality)
- Wikipedia and other encyclopedic sources
- Books and published content
- Code repositories (GitHub, etc.)
- Scientific papers and research
Optimisation Strategies for ChatGPT
1. For Training Data Inclusion
- • Get featured in major publications (TechCrunch, Forbes, industry publications)
- • Maintain accurate Wikipedia presence if notable
- • Publish on high-authority domains that allow GPTBot crawling
- • Create comprehensive, factual content about your brand
2. For Web Search Retrieval
- • Maintain strong Google SEO (ChatGPT uses Google for browsing)
- • Optimise for featured snippets and direct answers
- • Keep content fresh with clear update dates
- • Structure content with clear headings and FAQ sections
3. Technical Considerations
- • Allow GPTBot in robots.txt to enable training data crawling
- • Use schema markup for entity disambiguation
- • Ensure fast page load times for search retrieval
- • Make key information accessible without JavaScript
Technical Implementation
OpenAI's GPTBot crawler can be controlled via robots.txt. Learn how to configure it to allow training data crawling while protecting sensitive content.
GPTBot ConfigurationCommon ChatGPT Issues
Outdated Information
ChatGPT may cite old pricing, discontinued products, or outdated company descriptions from its June 2024 training data.
Competitor Confusion
If your brand name is similar to others, ChatGPT may conflate information. Use distinctive brand language consistently.
Missing from Responses
If ChatGPT doesn't mention your brand, it likely lacks sufficient training data. Focus on building authoritative content presence.
Sources
- OpenAI GPT-4o Model Documentation: Official model specifications and capabilities
- Hello GPT-4o | OpenAI: Original GPT-4o announcement (May 2024)
- GPT-4o System Card: Safety evaluations and technical specifications
- GPTBot Documentation: OpenAI's web crawler for training data
- How People Are Using ChatGPT | OpenAI: Usage statistics and user data