AI Crawler Landscape

AI companies deploy crawlers to gather training data and enable real-time retrieval. Understanding these bots is the first step to controlling your AI visibility.

In this guide

Identify all major AI crawler user agents
Understand the difference between training and retrieval crawlers
Know which companies use which bots
Make informed decisions about crawler access

10 min read

AI Crawler User Agents

This table lists all known AI crawler user agents as of 2025. Use this reference when configuring your robots.txt.

User Agent	Company	Purpose	Respects robots.txt
GPTBot	OpenAI	Training data collection	Yes
ChatGPT-User	OpenAI	Real-time browsing (ChatGPT)	Yes
OAI-SearchBot	OpenAI	SearchGPT retrieval	Yes
anthropic-ai	Anthropic	Training data collection	Yes
ClaudeBot	Anthropic	Training data collection	Yes
Google-Extended	Google	Gemini/Bard training	Yes
PerplexityBot	Perplexity	Real-time search retrieval	Yes
Bytespider	ByteDance	Training data collection	Partial
CCBot	Common Crawl	Open dataset (used by many)	Yes
FacebookBot	Meta	Training data (Llama)	Yes
Meta-ExternalAgent	Meta	External training data	Yes
cohere-ai	Cohere	Training data collection	Yes
Amazonbot	Amazon	Alexa/AI training	Yes

Training vs. Retrieval Crawlers

AI crawlers serve two distinct purposes, and this distinction affects how you should think about them:

Training Crawlers

Collect content to train AI models. Once your content is in training data, it's "remembered" by the model.

GPTBot • ClaudeBot • Google-Extended

Impact: Long-term visibility in AI responses

Retrieval Crawlers

Fetch content in real-time when users ask questions. Similar to traditional search engine behavior.

ChatGPT-User • PerplexityBot

Impact: Immediate visibility in search-enabled AI

Key Takeaway

You may want to allow retrieval bots while blocking training bots, or the other way around.

Retrieval bots bring traffic (like search engines). Training bots use your content to train models without attribution. Your strategy depends on your goals.

Crawler Behavior Differences

AI crawlers behave differently from traditional search engine bots:

Crawl Patterns

AI training crawlers often do deep, infrequent crawls rather than continuous monitoring. They may crawl your entire site in a short burst, then not return for months.

JavaScript Rendering

Most AI crawlers do NOT render JavaScript. They see only your initial HTML. SPAs and client-side rendered content may be invisible to training bots.

Rate Limiting

AI crawlers generally respect crawl-delay directives, but some (like Bytespider) have been reported to ignore limits. Monitor your logs.

Authentication

AI crawlers won't log in or bypass paywalls. Only publicly accessible content is crawled.

Business Context

Understanding why AI companies crawl content helps you make better decisions about crawler access. Training data directly affects what AI models "know" about your brand.

How AI Training Works

Identifying AI Bots in Your Logs

Look for these patterns in your server logs:

# GPTBot example
66.249.66.1 - - [01/Jan/2024:00:00:00] "GET /page HTTP/1.1" 200 - "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)"

# ClaudeBot example
192.0.2.1 - - [01/Jan/2024:00:00:00] "GET /page HTTP/1.1" 200 - "-" "ClaudeBot/1.0; +https://www.anthropic.com/claude-bot"

# PerplexityBot example
203.0.113.1 - - [01/Jan/2024:00:00:00] "GET /page HTTP/1.1" 200 - "-" "PerplexityBot"

See the Log Analysis guide for detailed instructions on tracking AI crawler activity.

Quick Reference: Allow or Block?

Goal	Recommendation
Maximum AI visibility	Allow all AI crawlers
Search-only visibility	Block training bots, allow retrieval bots
Protect content from training	Block all training crawlers
No AI presence	Block all AI crawlers

Sources

OpenAI Bots Documentation: GPTBot, ChatGPT-User, and OAI-SearchBot specifications
Anthropic Crawler Documentation: ClaudeBot and anthropic-ai specifications
From Googlebot to GPTBot | Cloudflare: AI crawler traffic analysis 2025
Dark Visitors AI Agent Directory: Comprehensive AI crawler reference