Robots.txt for AI Crawlers

Your robots.txt file is your first line of control over AI crawler access. Most major AI companies respect these directives, so configure them thoughtfully.

In this guide

Configure robots.txt for specific AI crawlers
Allow or block training data collection
Protect sensitive sections while allowing others
Common configuration patterns and examples

12 min read Prerequisite: AI Crawler Landscape

Basic Syntax

Robots.txt uses simple directives. Each AI crawler has a specific user-agent you can target:

# Target a specific AI crawler
User-agent: GPTBot
Disallow: /private/
Allow: /

# Block all AI crawlers from a section
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
Disallow: /members/

Configuration Examples

Allow All AI Crawlers (Maximum Visibility)

# Allow all AI crawlers full access
# This maximises your AI visibility

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: CCBot
Allow: /

Block Training, Allow Search (Balanced Approach)

# Block training crawlers (your content won't be used for model training)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Allow retrieval crawlers (AI can still cite you in real-time)
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

Block All AI Crawlers

# Complete AI crawler block
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Amazonbot
Disallow: /

Selective Access Patterns

You can fine-tune access to allow crawlers on some sections but not others:

Allow Blog, Block Products

# AI can learn from your blog but not index products
User-agent: GPTBot
Allow: /blog/
Disallow: /products/
Disallow: /pricing/

User-agent: ClaudeBot
Allow: /blog/
Disallow: /products/
Disallow: /pricing/

Protect User-Generated Content

# Block areas with user content (privacy protection)
User-agent: GPTBot
Allow: /
Disallow: /forum/
Disallow: /comments/
Disallow: /user/
Disallow: /profile/

Key Takeaway

Remember: robots.txt is a request, not a lock.

Reputable AI companies (OpenAI, Anthropic, Google) respect robots.txt. However, there's no technical enforcement, so bad actors can ignore it. Don't rely on robots.txt alone for truly sensitive content.

Crawl-Delay for AI Bots

If AI crawlers are hitting your server too aggressively, you can request a delay:

# Request 10 seconds between requests
User-agent: GPTBot
Crawl-delay: 10
Allow: /

# Note: Not all bots respect Crawl-delay

Crawl-delay Support

Supported: GPTBot, PerplexityBot, CCBot
Not officially supported: Google-Extended (Google ignores crawl-delay)

Testing Your Configuration

After updating robots.txt, verify it works as expected:

1

Check syntax

Use Google's robots.txt Tester in Search Console to validate syntax.
2

Verify accessibility

Ensure your robots.txt is accessible at yourdomain.com/robots.txt
3

Monitor logs

Watch your server logs to confirm crawlers are respecting the rules.

Business Context

Allowing AI crawlers is just the first step. To maximize the benefit, you need to ensure your content is structured in a way that AI can understand and accurately represent.

Building AI Authority

Common Mistakes

Accidentally blocking everything

Disallow: / under User-agent: * will block ALL crawlers, including Google.

Typos in user-agent names

GPT-Bot ≠ GPTBot. User-agent names are case-sensitive and must be exact.

Forgetting the trailing slash

Disallow: /blog blocks the literal path. Use Disallow: /blog/ to block the directory.

Complete Reference Template

Copy this template and modify based on your needs:

# ======================
# AI Crawler Configuration
# ======================

# OpenAI
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic
User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

# Google
User-agent: Google-Extended
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

# Common Crawl (used by many AI companies)
User-agent: CCBot
Allow: /

# Meta
User-agent: FacebookBot
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

# ByteDance
User-agent: Bytespider
Disallow: /

# Other AI
User-agent: cohere-ai
Allow: /

User-agent: Amazonbot
Allow: /

# ======================
# Standard Search Crawlers
# ======================
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Sitemap
Sitemap: https://yourdomain.com/sitemap.xml

Sources

GPTBot Documentation | OpenAI: Official robots.txt configuration for GPTBot
ClaudeBot Configuration | Anthropic: Official robots.txt guidance for Anthropic crawlers
Robots.txt Introduction | Google: Robots.txt specification and best practices
Robots.txt Standard | robotstxt.org: Official robots exclusion protocol