Technical

Robots.txt and AI Crawlers: Are You Blocking ChatGPT?

Robots.txt is a simple text file at the root of your website that tells crawlers which pages they can and cannot access. Most site owners set it up once and forget it. But the AI crawler landscape has changed dramatically in the past two years, and many sites are now blocking AI crawlers that did not exist when their robots.txt was last updated. If ChatGPT, Claude, or Perplexity cannot access your site, they cannot cite you.

The AI crawlers that matter

The primary AI crawlers you need to allow are: GPTBot (OpenAI, powers ChatGPT and the web browsing feature), ClaudeBot (Anthropic, powers Claude AI), PerplexityBot (Perplexity AI), Google-Extended (Google's AI training and AI Overviews crawler, separate from Googlebot), Amazonbot (Amazon Alexa AI), Meta-ExternalAgent (Meta AI), and YouBot (You.com). Each of these respects robots.txt instructions. Any that are blocked in your robots.txt will not be able to access your site for AI purposes.

How to check your current robots.txt

Visit yourdomain.com/robots.txt in a browser. Look for any lines that say 'User-agent: GPTBot' or 'User-agent: *' followed by 'Disallow: /'. A wildcard block (User-agent: * / Disallow: /) blocks every crawler including all AI bots. A specific block (User-agent: GPTBot / Disallow: /) blocks only that AI crawler. Also check for any security-layer blocks in Cloudflare, Nginx, or other infrastructure that might return 403 or 404 responses to known AI bot user agents, which has the same effect as a robots.txt block.

Fixing accidental AI crawler blocks

If you have a wildcard block and want to allow AI crawlers while maintaining other blocks, you need to add specific allow rules before the disallow. The order matters: robots.txt is parsed top-to-bottom and the first matching rule wins. Add each AI crawler with an explicit allow before any wildcard disallow rule. If you use Cloudflare Bot Management or similar tools, check whether they have rules targeting AI crawlers by user agent string. These infrastructure-level blocks are invisible in robots.txt but have the same effect.

When blocking is intentional

There are legitimate reasons to block some AI crawlers. If your site contains proprietary content you do not want used in AI training, blocking training crawlers (as distinct from retrieval crawlers) is a reasonable choice. Note that some AI systems use separate user agents for training versus real-time retrieval. If you want to be cited in AI answers without contributing to AI training data, check whether each crawler offers separate opt-outs for training versus retrieval. OpenAI, for instance, has published guidance on this distinction for GPTBot.

Recommended robots.txt configuration

A minimal AEO-friendly robots.txt allows all crawlers by default, since most AI crawlers are benign. The standard minimum is: User-agent: * / Allow: / with a Sitemap: line pointing to your sitemap.xml. If you need to block specific paths (e.g. admin areas, staging pages, duplicate content), use targeted Disallow rules on those specific paths rather than broad rules that might catch AI crawlers. After updating robots.txt, run an AEO audit at /aeo/scores to confirm crawler access is no longer a failing signal.

Ready to improve your AI visibility?

Run a free audit and get your score across 6 AEO categories.

Check if AI crawlers can access your site