Robots.txt for AI Crawlers
Your robots.txt file is your first line of control over AI crawler access. Most major AI companies respect these directives, so configure them thoughtfully.
In this guide
- Configure robots.txt for specific AI crawlers
- Allow or block training data collection
- Protect sensitive sections while allowing others
- Common configuration patterns and examples
Basic Syntax
Robots.txt uses simple directives. Each AI crawler has a specific user-agent you can target:
# Target a specific AI crawler
User-agent: GPTBot
Disallow: /private/
Allow: /
# Block all AI crawlers from a section
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
Disallow: /members/ Configuration Examples
Allow All AI Crawlers (Maximum Visibility)
# Allow all AI crawlers full access
# This maximises your AI visibility
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: CCBot
Allow: / Block Training, Allow Search (Balanced Approach)
# Block training crawlers (your content won't be used for model training)
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
# Allow retrieval crawlers (AI can still cite you in real-time)
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: / Block All AI Crawlers
# Complete AI crawler block
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Amazonbot
Disallow: / Selective Access Patterns
You can fine-tune access to allow crawlers on some sections but not others:
Allow Blog, Block Products
# AI can learn from your blog but not index products
User-agent: GPTBot
Allow: /blog/
Disallow: /products/
Disallow: /pricing/
User-agent: ClaudeBot
Allow: /blog/
Disallow: /products/
Disallow: /pricing/ Protect User-Generated Content
# Block areas with user content (privacy protection)
User-agent: GPTBot
Allow: /
Disallow: /forum/
Disallow: /comments/
Disallow: /user/
Disallow: /profile/ Key Takeaway
Remember: robots.txt is a request, not a lock.
Reputable AI companies (OpenAI, Anthropic, Google) respect robots.txt. However, there's no technical enforcement, so bad actors can ignore it. Don't rely on robots.txt alone for truly sensitive content.
Crawl-Delay for AI Bots
If AI crawlers are hitting your server too aggressively, you can request a delay:
# Request 10 seconds between requests
User-agent: GPTBot
Crawl-delay: 10
Allow: /
# Note: Not all bots respect Crawl-delay Crawl-delay Support
Supported: GPTBot, PerplexityBot, CCBot
Not officially supported: Google-Extended (Google ignores crawl-delay)
Testing Your Configuration
After updating robots.txt, verify it works as expected:
- 1
Check syntax
Use Google's robots.txt Tester in Search Console to validate syntax.
- 2
Verify accessibility
Ensure your robots.txt is accessible at
yourdomain.com/robots.txt - 3
Monitor logs
Watch your server logs to confirm crawlers are respecting the rules.
Business Context
Allowing AI crawlers is just the first step. To maximize the benefit, you need to ensure your content is structured in a way that AI can understand and accurately represent.
Building AI AuthorityCommon Mistakes
Accidentally blocking everything
Disallow: / under User-agent: * will block ALL crawlers, including Google.
Typos in user-agent names
GPT-Bot ≠ GPTBot. User-agent names are case-sensitive and must be exact.
Forgetting the trailing slash
Disallow: /blog blocks the literal path. Use Disallow: /blog/ to block the directory.
Complete Reference Template
Copy this template and modify based on your needs:
# ======================
# AI Crawler Configuration
# ======================
# OpenAI
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
# Anthropic
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
# Google
User-agent: Google-Extended
Allow: /
# Perplexity
User-agent: PerplexityBot
Allow: /
# Common Crawl (used by many AI companies)
User-agent: CCBot
Allow: /
# Meta
User-agent: FacebookBot
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
# ByteDance
User-agent: Bytespider
Disallow: /
# Other AI
User-agent: cohere-ai
Allow: /
User-agent: Amazonbot
Allow: /
# ======================
# Standard Search Crawlers
# ======================
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# Sitemap
Sitemap: https://yourdomain.com/sitemap.xml Sources
- GPTBot Documentation | OpenAI: Official robots.txt configuration for GPTBot
- ClaudeBot Configuration | Anthropic: Official robots.txt guidance for Anthropic crawlers
- Robots.txt Introduction | Google: Robots.txt specification and best practices
- Robots.txt Standard | robotstxt.org: Official robots exclusion protocol