Skip to content

AI Crawler Landscape

AI companies deploy crawlers to gather training data and enable real-time retrieval. Understanding these bots is the first step to controlling your AI visibility.

In this guide

  • Identify all major AI crawler user agents
  • Understand the difference between training and retrieval crawlers
  • Know which companies use which bots
  • Make informed decisions about crawler access
10 min read

AI Crawler User Agents

This table lists all known AI crawler user agents as of 2025. Use this reference when configuring your robots.txt.

User Agent Company Purpose Respects robots.txt
GPTBot OpenAI Training data collection Yes
ChatGPT-User OpenAI Real-time browsing (ChatGPT) Yes
OAI-SearchBot OpenAI SearchGPT retrieval Yes
anthropic-ai Anthropic Training data collection Yes
ClaudeBot Anthropic Training data collection Yes
Google-Extended Google Gemini/Bard training Yes
PerplexityBot Perplexity Real-time search retrieval Yes
Bytespider ByteDance Training data collection Partial
CCBot Common Crawl Open dataset (used by many) Yes
FacebookBot Meta Training data (Llama) Yes
Meta-ExternalAgent Meta External training data Yes
cohere-ai Cohere Training data collection Yes
Amazonbot Amazon Alexa/AI training Yes

Training vs. Retrieval Crawlers

AI crawlers serve two distinct purposes, and this distinction affects how you should think about them:

Training Crawlers

Collect content to train AI models. Once your content is in training data, it's "remembered" by the model.

GPTBot ClaudeBot Google-Extended

Impact: Long-term visibility in AI responses

Retrieval Crawlers

Fetch content in real-time when users ask questions. Similar to traditional search engine behavior.

ChatGPT-User PerplexityBot

Impact: Immediate visibility in search-enabled AI

Key Takeaway

You may want to allow retrieval bots while blocking training bots, or the other way around.

Retrieval bots bring traffic (like search engines). Training bots use your content to train models without attribution. Your strategy depends on your goals.

Crawler Behavior Differences

AI crawlers behave differently from traditional search engine bots:

Crawl Patterns

AI training crawlers often do deep, infrequent crawls rather than continuous monitoring. They may crawl your entire site in a short burst, then not return for months.

JavaScript Rendering

Most AI crawlers do NOT render JavaScript. They see only your initial HTML. SPAs and client-side rendered content may be invisible to training bots.

Rate Limiting

AI crawlers generally respect crawl-delay directives, but some (like Bytespider) have been reported to ignore limits. Monitor your logs.

Authentication

AI crawlers won't log in or bypass paywalls. Only publicly accessible content is crawled.

Business Context

Understanding why AI companies crawl content helps you make better decisions about crawler access. Training data directly affects what AI models "know" about your brand.

How AI Training Works

Identifying AI Bots in Your Logs

Look for these patterns in your server logs:

# GPTBot example
66.249.66.1 - - [01/Jan/2024:00:00:00] "GET /page HTTP/1.1" 200 - "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)"

# ClaudeBot example
192.0.2.1 - - [01/Jan/2024:00:00:00] "GET /page HTTP/1.1" 200 - "-" "ClaudeBot/1.0; +https://www.anthropic.com/claude-bot"

# PerplexityBot example
203.0.113.1 - - [01/Jan/2024:00:00:00] "GET /page HTTP/1.1" 200 - "-" "PerplexityBot"

See the Log Analysis guide for detailed instructions on tracking AI crawler activity.

Quick Reference: Allow or Block?

Goal Recommendation
Maximum AI visibility Allow all AI crawlers
Search-only visibility Block training bots, allow retrieval bots
Protect content from training Block all training crawlers
No AI presence Block all AI crawlers

Sources