2026-06-10 · 8 min read · HejGeo Team
Every AI Crawler in 2026: The Complete robots.txt Guide
Sometime in 2023, "block the AI bots" became reflex advice. Publishers added a wall of Disallow: / rules, felt protected, and moved on. Three years later, many of those same sites are asking why they never show up in ChatGPT or Perplexity answers.
Here's the uncomfortable answer: they blocked the wrong bots. As of June 2026, every major AI company runs separate crawlers for model training, search indexing, and live user requests — and they honor separate robots.txt entries. Block the training bot and you keep your content out of the next model. Block the search bot and you erase yourself from AI answers entirely.
This guide covers every AI crawler that matters in 2026, what each one actually does, how to verify the real ones, and three copy-paste robots.txt templates depending on where you stand.
The three types of AI bots (and why the difference is everything)
Treating "AI crawlers" as one category is the single most common robots.txt mistake we see in site audits. There are three distinct types:
1. Training crawlers collect content to train future foundation models. Examples: GPTBot, ClaudeBot, meta-externalagent, CCBot. Blocking these keeps your content out of training datasets. It does not remove you from AI search results.
2. Search-index crawlers build the retrieval indexes that AI assistants cite when answering questions. Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot. Blocking these makes you invisible when ChatGPT, Claude, or Perplexity search the web. If AI visibility matters to your business, these bots are as important as Googlebot.
3. User-action fetchers retrieve a specific page because a human asked for it — someone pasted your URL into ChatGPT, or an assistant fetched your pricing page mid-conversation. Examples: ChatGPT-User, Claude-User, Perplexity-User. These are the closest thing AI has to a "click." Both OpenAI and Perplexity note that robots.txt rules may not apply to these user-initiated requests, since a person — not a scheduler — triggered them.
One more wrinkle: Google-Extended and Applebot-Extended are not crawlers at all. They are robots.txt tokens. They never make HTTP requests and will never appear in your server logs. Google-Extended controls whether content already crawled by Googlebot can be used for Gemini training and grounding; Applebot-Extended does the same for Apple's foundation models. Note what they don't control: blocking Google-Extended does not remove you from Google's AI Overviews — those are fed by regular Googlebot, and the only way out is leaving Google Search entirely.
The complete AI crawler table (June 2026)
| Bot | Company | Type | Respects robots.txt? | Should you allow it? |
|---|---|---|---|---|
| GPTBot | OpenAI | Training | Yes | Your call — training only, no visibility impact |
| OAI-SearchBot | OpenAI | Search index | Yes | Yes — powers ChatGPT search citations |
| ChatGPT-User | OpenAI | User action | Partially (user-initiated) | Yes — these are real humans reading your page |
| ClaudeBot | Anthropic | Training | Yes | Your call — training only |
| Claude-SearchBot | Anthropic | Search index | Yes | Yes — powers Claude's web search results |
| Claude-User | Anthropic | User action | Yes (per Anthropic) | Yes |
| PerplexityBot | Perplexity | Search index | Yes | Yes — explicitly not used for training |
| Perplexity-User | Perplexity | User action | Generally no for user-requested fetches (per Perplexity's own docs) | Yes — blocking it barely works anyway |
| Google-Extended | Training token (no crawler) | n/a — robots.txt directive only | Your call — affects Gemini training/grounding, not Search or AI Overviews | |
| Applebot-Extended | Apple | Training token (no crawler) | n/a — robots.txt directive only | Your call — Applebot itself still powers Siri/Spotlight |
| CCBot | Common Crawl | Training (dataset feeds many models) | Yes | Your call — blocking exits many datasets at once, including research corpora |
| Bytespider | ByteDance | Training | No — widely reported to ignore robots.txt and spoof UAs | Block at firewall level if you care; robots.txt alone won't stop it |
| meta-externalagent | Meta | Training | Yes | Your call — feeds Llama and Meta AI |
| Amazonbot | Amazon | Training + Alexa answers | Yes | Lean yes if Alexa/Rufus answers matter to you |
Three honest footnotes to this table:
- "Your call" is genuinely your call. Whether your content trains future models is a licensing and philosophy question, not a technical one. There's a reasonable argument that being in training data makes models more likely to know your brand exists — but as of June 2026 nobody has published convincing causal evidence either way. We won't pretend otherwise.
- Bytespider is the outlier. ByteDance publishes no IP ranges, and the bot is widely reported in industry analyses to ignore robots.txt — there is no first-party confirmation from ByteDance either way. If blocking it matters to you, do it in your WAF or CDN, not in a text file it doesn't read.
- xAI's Grok has no reliably documented crawler with published IP ranges, which is why it's absent from the table. Industry reporting suggests its fetching is effectively unverifiable.
Verify before you trust (or block)
Any scraper can call itself GPTBot. The user-agent string is just a header — verification means checking the source IP. The good news: OpenAI, Perplexity, and Anthropic all publish machine-readable IP ranges as of June 2026 (Anthropic added theirs after previously declining to publish ranges):
| Provider | IP range list |
|---|---|
| OpenAI (GPTBot) | https://openai.com/gptbot.json |
| OpenAI (OAI-SearchBot) | https://openai.com/searchbot.json |
| OpenAI (ChatGPT-User) | https://openai.com/chatgpt-user.json |
| Anthropic (all bots) | https://claude.com/crawling/bots.json |
| Perplexity (PerplexityBot) | https://www.perplexity.com/perplexitybot.json |
| Perplexity (Perplexity-User) | https://www.perplexity.com/perplexity-user.json |
| Common Crawl (CCBot) | https://index.commoncrawl.org/ccbot.json |
A quick check from your terminal:
curl -s https://openai.com/gptbot.json | jq -r '.prefixes[] | .ipv4Prefix // .ipv6Prefix'
If a request claims to be GPTBot from an IP outside those ranges, it's an impostor — block it without guilt. This also means you should never block AI crawlers by IP guesswork: a misfired IP block can prevent the legitimate bot from even reading your robots.txt.
Three robots.txt templates
Pick the stance that matches your business, copy, paste, adjust the Disallow paths for your own private routes.
Stance 1: Maximum visibility
For brands, SaaS, publishers monetizing attention — anyone whose problem is not enough AI visibility. This is the policy we run on hejgeo.com.
# Search & user-action bots — these put you in AI answers
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: PerplexityBot
User-agent: Perplexity-User
Allow: /
# Training bots — allowed: we want models to know us
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: meta-externalagent
User-agent: Amazonbot
User-agent: CCBot
Allow: /
# Keep private areas private for everyone
User-agent: *
Disallow: /api/
Disallow: /app/
Sitemap: https://www.example.com/sitemap.xml
Stance 2: Balanced — visible in AI answers, out of training data
The pragmatic middle ground: AI assistants can cite you, but your content doesn't train the next model generation.
# Allow: search & user-action bots (AI visibility)
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: PerplexityBot
User-agent: Perplexity-User
Allow: /
# Block: training bots and training tokens
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: meta-externalagent
User-agent: Amazonbot
User-agent: CCBot
User-agent: Bytespider
Disallow: /
User-agent: *
Disallow: /api/
Sitemap: https://www.example.com/sitemap.xml
Stance 3: Restrictive — block everything AI
For paywalled or licensed content where any AI reuse is off the table. Go in with open eyes: you will not appear in ChatGPT, Claude, or Perplexity answers, and the bots that ignore robots.txt need a WAF rule on top.
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: meta-externalagent
User-agent: Amazonbot
User-agent: CCBot
User-agent: Bytespider
Disallow: /
Sitemap: https://www.example.com/sitemap.xml
Remember: robots.txt is a request, not a lock. Compliant companies honor it; Bytespider is widely reported not to, and user-action fetchers operate under different rules because a human initiated the request. If "block everything" is a legal requirement rather than a preference, enforce it at the CDN level.
How to detect AI crawler traffic
You can't manage what you don't measure. Two practical routes:
Server logs. Grep your access logs for the user-agent strings:
grep -ciE "GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-SearchBot|Claude-User|PerplexityBot|Perplexity-User|meta-externalagent|Amazonbot|Bytespider|CCBot" access.log
Then split by bot to see which kind of attention you're getting. Pay special attention to ChatGPT-User and Perplexity-User hits: each one means an AI assistant fetched your page live, mid-conversation, for a real person. That's the nearest thing to an "AI referral" metric most sites have, since assistants send little conventional referrer traffic.
Cloudflare AI Crawl Control. If you're behind Cloudflare, AI Crawl Control (formerly AI Audit) gives you per-crawler analytics and one-click allow/block rules — user-agent-based detection on free plans, fingerprint-based identification with Bot Management on paid plans. As of June 2026 their Pay-per-Crawl program (charging AI companies for access) is still in closed beta. It's the fastest way to get visibility without touching log files, and the per-crawler toggles enforce your policy even against UA spoofing on paid tiers.
Whichever route you choose, cross-check suspicious traffic against the published IP ranges above. In our experience, a noticeable share of "GPTBot" traffic on smaller sites is third-party scrapers borrowing the name.
The takeaway
Your robots.txt is now a strategic document. It decides whether AI assistants can recommend you, whether your content trains future models, and whether you can tell legitimate crawlers from impostors. The worst position is the accidental one — a blanket block from 2023 quietly deleting you from the fastest-growing discovery channel, or a default-open config you've never reviewed.
Set a deliberate policy. Verify who's actually crawling. Re-check quarterly — this list looked different a year ago and will look different next year.
If you want to see what your robots.txt is actually doing to your AI visibility: HejGeo audits your AI-bot policy as part of its SEO crawl and tracks how often ChatGPT, Claude, Perplexity, and Gemini actually mention and cite your site. The free plan includes the full audit and weekly visibility tracking — no credit card, no per-engine add-ons. Run a free check at hejgeo.com.