AI Crawler Landscape
AI companies deploy crawlers to gather training data and enable real-time retrieval. Understanding these bots is the first step to controlling your AI visibility.
In this guide
- Identify all major AI crawler user agents
- Understand the difference between training and retrieval crawlers
- Know which companies use which bots
- Make informed decisions about crawler access
AI Crawler User Agents
This table lists all known AI crawler user agents as of 2025. Use this reference when configuring your robots.txt.
| User Agent | Company | Purpose | Respects robots.txt |
|---|---|---|---|
| GPTBot | OpenAI | Training data collection | Yes |
| ChatGPT-User | OpenAI | Real-time browsing (ChatGPT) | Yes |
| OAI-SearchBot | OpenAI | SearchGPT retrieval | Yes |
| anthropic-ai | Anthropic | Training data collection | Yes |
| ClaudeBot | Anthropic | Training data collection | Yes |
| Google-Extended | Gemini/Bard training | Yes | |
| PerplexityBot | Perplexity | Real-time search retrieval | Yes |
| Bytespider | ByteDance | Training data collection | Partial |
| CCBot | Common Crawl | Open dataset (used by many) | Yes |
| FacebookBot | Meta | Training data (Llama) | Yes |
| Meta-ExternalAgent | Meta | External training data | Yes |
| cohere-ai | Cohere | Training data collection | Yes |
| Amazonbot | Amazon | Alexa/AI training | Yes |
Training vs. Retrieval Crawlers
AI crawlers serve two distinct purposes, and this distinction affects how you should think about them:
Training Crawlers
Collect content to train AI models. Once your content is in training data, it's "remembered" by the model.
Impact: Long-term visibility in AI responses
Retrieval Crawlers
Fetch content in real-time when users ask questions. Similar to traditional search engine behavior.
Impact: Immediate visibility in search-enabled AI
Key Takeaway
You may want to allow retrieval bots while blocking training bots, or the other way around.
Retrieval bots bring traffic (like search engines). Training bots use your content to train models without attribution. Your strategy depends on your goals.
Crawler Behavior Differences
AI crawlers behave differently from traditional search engine bots:
Crawl Patterns
AI training crawlers often do deep, infrequent crawls rather than continuous monitoring. They may crawl your entire site in a short burst, then not return for months.
JavaScript Rendering
Most AI crawlers do NOT render JavaScript. They see only your initial HTML. SPAs and client-side rendered content may be invisible to training bots.
Rate Limiting
AI crawlers generally respect crawl-delay directives, but some (like Bytespider) have been reported to ignore limits. Monitor your logs.
Authentication
AI crawlers won't log in or bypass paywalls. Only publicly accessible content is crawled.
Business Context
Understanding why AI companies crawl content helps you make better decisions about crawler access. Training data directly affects what AI models "know" about your brand.
How AI Training WorksIdentifying AI Bots in Your Logs
Look for these patterns in your server logs:
# GPTBot example
66.249.66.1 - - [01/Jan/2024:00:00:00] "GET /page HTTP/1.1" 200 - "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)"
# ClaudeBot example
192.0.2.1 - - [01/Jan/2024:00:00:00] "GET /page HTTP/1.1" 200 - "-" "ClaudeBot/1.0; +https://www.anthropic.com/claude-bot"
# PerplexityBot example
203.0.113.1 - - [01/Jan/2024:00:00:00] "GET /page HTTP/1.1" 200 - "-" "PerplexityBot" See the Log Analysis guide for detailed instructions on tracking AI crawler activity.
Quick Reference: Allow or Block?
| Goal | Recommendation |
|---|---|
| Maximum AI visibility | Allow all AI crawlers |
| Search-only visibility | Block training bots, allow retrieval bots |
| Protect content from training | Block all training crawlers |
| No AI presence | Block all AI crawlers |
Sources
- OpenAI Bots Documentation: GPTBot, ChatGPT-User, and OAI-SearchBot specifications
- Anthropic Crawler Documentation: ClaudeBot and anthropic-ai specifications
- From Googlebot to GPTBot | Cloudflare: AI crawler traffic analysis 2025
- Dark Visitors AI Agent Directory: Comprehensive AI crawler reference