Core Concepts

What Is GPTBot? OpenAI's Web Crawler Explained

GPTBot is the web crawler operated by OpenAI, the company behind ChatGPT. It visits websites to collect content that is used for AI model training and real-time web retrieval in ChatGPT's browsing feature. Understanding how GPTBot works, what it can access, and how to configure your site for it is a core AEO task for any brand that wants to appear in ChatGPT's answers.

GPTBot's two functions

GPTBot serves two distinct purposes. First, it crawls the web for training data that is incorporated into future versions of ChatGPT's language models. Content collected for training is used to update the model's knowledge at each training cycle. Second, GPTBot powers ChatGPT's real-time web browsing feature: when a ChatGPT user enables web browsing, GPTBot retrieves current content from the web and uses it to supplement ChatGPT's response. Allowing GPTBot access serves both functions and maximizes your brand's presence in both ChatGPT's training knowledge and its live web responses.

How to identify GPTBot in your server logs

GPTBot identifies itself with the user agent string 'GPTBot' followed by a version number, for example: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot). OpenAI also publishes the IP address ranges that GPTBot requests originate from. You can verify that a request claiming to be GPTBot is legitimate by checking whether the source IP falls within OpenAI's published ranges. This verification step is useful if you see unusual traffic claiming to be GPTBot or if you want to confirm that the bot visiting your site is genuine.

Configuring robots.txt for GPTBot

GPTBot respects robots.txt directives. To allow GPTBot full access, ensure there is no Disallow rule targeting GPTBot or a wildcard block that would catch it. The explicit allow configuration is: User-agent: GPTBot followed by Allow: /. If you want to allow GPTBot but exclude specific sections (e.g. private content, premium content, or sections with legal restrictions on AI use), add targeted Disallow rules for those paths. OpenAI also allows site owners to disallow training data collection while permitting real-time retrieval, though the separation requires careful configuration documented in OpenAI's publisher guidance.

GPTBot vs ChatGPT-User

OpenAI uses two related but distinct user agents. GPTBot is the primary crawler for training data collection and broader web indexing. ChatGPT-User is the user agent used specifically when ChatGPT's web browsing tool fetches a page in response to a user request. For maximum ChatGPT citation coverage, allow both user agents. If you want to restrict training data collection but still appear in ChatGPT web browsing results, you can block GPTBot while allowing ChatGPT-User. This is a legitimate choice for sites with proprietary content that want citation access without contributing to AI training.

What GPTBot cannot access

GPTBot cannot access pages behind authentication, paywalls, or login walls. It cannot execute complex JavaScript beyond basic rendering and cannot access content generated dynamically after a user interaction (clicking a button, submitting a form). It follows robots.txt restrictions and will not access blocked paths. Pages with noindex meta tags are typically honored as a signal not to use that content. If your most important content is behind any of these barriers, consider creating a publicly accessible version or summary specifically for AI crawler access: a landing page that describes your protected content in accessible terms.

Ready to improve your AI visibility?

Run a free audit and get your score across 6 AEO categories.

Check if GPTBot can access your site