AI Crawlers & robots.txt Guide 2026

Sometime in 2023, "block the AI bots" became reflex advice. Publishers added a wall of Disallow: / rules, felt protected, and moved on. Three years later, many of those same sites are asking why they never show up in ChatGPT or Perplexity answers.

Here's the uncomfortable answer: they blocked the wrong bots. As of June 2026, every major AI company runs separate crawlers for model training, search indexing, and live user requests — and they honor separate robots.txt entries. Block the training bot and you keep your content out of the next model. Block the search bot and you erase yourself from AI answers entirely.

This guide covers every AI crawler that matters in 2026, what each one actually does, how to verify the real ones, and three copy-paste robots.txt templates depending on where you stand.

The three types of AI bots (and why the difference is everything)

Treating "AI crawlers" as one category is the single most common robots.txt mistake we see in site audits. There are three distinct types:

1. Training crawlers collect content to train future foundation models. Examples: GPTBot, ClaudeBot, meta-externalagent, CCBot. Blocking these keeps your content out of training datasets. It does not remove you from AI search results.

2. Search-index crawlers build the retrieval indexes that AI assistants cite when answering questions. Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot. Blocking these makes you invisible when ChatGPT, Claude, or Perplexity search the web. If AI visibility matters to your business, these bots are as important as Googlebot.

3. User-action fetchers retrieve a specific page because a human asked for it — someone pasted your URL into ChatGPT, or an assistant fetched your pricing page mid-conversation. Examples: ChatGPT-User, Claude-User, Perplexity-User. These are the closest thing AI has to a "click." Both OpenAI and Perplexity note that robots.txt rules may not apply to these user-initiated requests, since a person — not a scheduler — triggered them.

One more wrinkle: Google-Extended and Applebot-Extended are not crawlers at all. They are robots.txt tokens. They never make HTTP requests and will never appear in your server logs. Google-Extended controls whether content already crawled by Googlebot can be used for Gemini training and grounding; Applebot-Extended does the same for Apple's foundation models. Note what they don't control: blocking Google-Extended does not remove you from Google's AI Overviews — those are fed by regular Googlebot, and the only way out is leaving Google Search entirely.

The complete AI crawler table (June 2026)

Bot	Company	Type	Respects robots.txt?	Should you allow it?
GPTBot	OpenAI	Training	Yes	Your call — training only, no visibility impact
OAI-SearchBot	OpenAI	Search index	Yes	Yes — powers ChatGPT search citations
ChatGPT-User	OpenAI	User action	Partially (user-initiated)	Yes — these are real humans reading your page
ClaudeBot	Anthropic	Training	Yes	Your call — training only
Claude-SearchBot	Anthropic	Search index	Yes	Yes — powers Claude's web search results
Claude-User	Anthropic	User action	Yes (per Anthropic)	Yes
PerplexityBot	Perplexity	Search index	Yes	Yes — explicitly not used for training
Perplexity-User	Perplexity	User action	Generally no for user-requested fetches (per Perplexity's own docs)	Yes — blocking it barely works anyway
Google-Extended	Google	Training token (no crawler)	n/a — robots.txt directive only	Your call — affects Gemini training/grounding, not Search or AI Overviews
Applebot-Extended	Apple	Training token (no crawler)	n/a — robots.txt directive only	Your call — Applebot itself still powers Siri/Spotlight
CCBot	Common Crawl	Training (dataset feeds many models)	Yes	Your call — blocking exits many datasets at once, including research corpora
Bytespider	ByteDance	Training	No — widely reported to ignore robots.txt and spoof UAs	Block at firewall level if you care; robots.txt alone won't stop it
meta-externalagent	Meta	Training	Yes	Your call — feeds Llama and Meta AI
Amazonbot	Amazon	Training + Alexa answers	Yes	Lean yes if Alexa/Rufus answers matter to you

Three honest footnotes to this table:

"Your call" is genuinely your call. Whether your content trains future models is a licensing and philosophy question, not a technical one. There's a reasonable argument that being in training data makes models more likely to know your brand exists — but as of June 2026 nobody has published convincing causal evidence either way. We won't pretend otherwise.
Bytespider is the outlier. ByteDance publishes no IP ranges, and the bot is widely reported in industry analyses to ignore robots.txt — there is no first-party confirmation from ByteDance either way. If blocking it matters to you, do it in your WAF or CDN, not in a text file it doesn't read.
xAI's Grok has no reliably documented crawler with published IP ranges, which is why it's absent from the table. Industry reporting suggests its fetching is effectively unverifiable.

Verify before you trust (or block)

Any scraper can call itself GPTBot. The user-agent string is just a header — verification means checking the source IP. The good news: OpenAI, Perplexity, and Anthropic all publish machine-readable IP ranges as of June 2026 (Anthropic added theirs after previously declining to publish ranges):

Provider	IP range list
OpenAI (GPTBot)	`https://openai.com/gptbot.json`
OpenAI (OAI-SearchBot)	`https://openai.com/searchbot.json`
OpenAI (ChatGPT-User)	`https://openai.com/chatgpt-user.json`
Anthropic (all bots)	`https://claude.com/crawling/bots.json`
Perplexity (PerplexityBot)	`https://www.perplexity.com/perplexitybot.json`
Perplexity (Perplexity-User)	`https://www.perplexity.com/perplexity-user.json`
Common Crawl (CCBot)	`https://index.commoncrawl.org/ccbot.json`

A quick check from your terminal:

curl -s https://openai.com/gptbot.json | jq -r '.prefixes[] | .ipv4Prefix // .ipv6Prefix'

If a request claims to be GPTBot from an IP outside those ranges, it's an impostor — block it without guilt. This also means you should never block AI crawlers by IP guesswork: a misfired IP block can prevent the legitimate bot from even reading your robots.txt.

Three robots.txt templates

Pick the stance that matches your business, copy, paste, adjust the Disallow paths for your own private routes.

Stance 1: Maximum visibility

For brands, SaaS, publishers monetizing attention — anyone whose problem is not enough AI visibility. This is the policy we run on hejgeo.com.

# Search & user-action bots — these put you in AI answers
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: PerplexityBot
User-agent: Perplexity-User
Allow: /

# Training bots — allowed: we want models to know us
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: meta-externalagent
User-agent: Amazonbot
User-agent: CCBot
Allow: /

# Keep private areas private for everyone
User-agent: *
Disallow: /api/
Disallow: /app/

Sitemap: https://www.example.com/sitemap.xml

Stance 2: Balanced — visible in AI answers, out of training data

The pragmatic middle ground: AI assistants can cite you, but your content doesn't train the next model generation.

# Allow: search & user-action bots (AI visibility)
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: PerplexityBot
User-agent: Perplexity-User
Allow: /

# Block: training bots and training tokens
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: meta-externalagent
User-agent: Amazonbot
User-agent: CCBot
User-agent: Bytespider
Disallow: /

User-agent: *
Disallow: /api/

Sitemap: https://www.example.com/sitemap.xml

Stance 3: Restrictive — block everything AI

For paywalled or licensed content where any AI reuse is off the table. Go in with open eyes: you will not appear in ChatGPT, Claude, or Perplexity answers, and the bots that ignore robots.txt need a WAF rule on top.

User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: meta-externalagent
User-agent: Amazonbot
User-agent: CCBot
User-agent: Bytespider
Disallow: /

Sitemap: https://www.example.com/sitemap.xml

Remember: robots.txt is a request, not a lock. Compliant companies honor it; Bytespider is widely reported not to, and user-action fetchers operate under different rules because a human initiated the request. If "block everything" is a legal requirement rather than a preference, enforce it at the CDN level.

How to detect AI crawler traffic

You can't manage what you don't measure. Two practical routes:

Server logs. Grep your access logs for the user-agent strings:

grep -ciE "GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-SearchBot|Claude-User|PerplexityBot|Perplexity-User|meta-externalagent|Amazonbot|Bytespider|CCBot" access.log

Then split by bot to see which kind of attention you're getting. Pay special attention to ChatGPT-User and Perplexity-User hits: each one means an AI assistant fetched your page live, mid-conversation, for a real person. That's the nearest thing to an "AI referral" metric most sites have, since assistants send little conventional referrer traffic.

Cloudflare AI Crawl Control. If you're behind Cloudflare, AI Crawl Control (formerly AI Audit) gives you per-crawler analytics and one-click allow/block rules — user-agent-based detection on free plans, fingerprint-based identification with Bot Management on paid plans. As of June 2026 their Pay-per-Crawl program (charging AI companies for access) is still in closed beta. It's the fastest way to get visibility without touching log files, and the per-crawler toggles enforce your policy even against UA spoofing on paid tiers.

Whichever route you choose, cross-check suspicious traffic against the published IP ranges above. In our experience, a noticeable share of "GPTBot" traffic on smaller sites is third-party scrapers borrowing the name.

The takeaway

Your robots.txt is now a strategic document. It decides whether AI assistants can recommend you, whether your content trains future models, and whether you can tell legitimate crawlers from impostors. The worst position is the accidental one — a blanket block from 2023 quietly deleting you from the fastest-growing discovery channel, or a default-open config you've never reviewed.

Set a deliberate policy. Verify who's actually crawling. Re-check quarterly — this list looked different a year ago and will look different next year.

If you want to see what your robots.txt is actually doing to your AI visibility: HejGeo audits your AI-bot policy as part of its SEO crawl and tracks how often ChatGPT, Claude, Perplexity, and Gemini actually mention and cite your site. The free plan includes the full audit and weekly visibility tracking — no credit card, no per-engine add-ons. Run a free check at hejgeo.com.

Every AI Crawler in 2026: The Complete robots.txt Guide