All resources

Guide

AI Crawlers 101: Is Your Site Even Letting GPTBot and ClaudeBot In?

If you want to show up in ChatGPT, Claude, or Perplexity, two things have to be true: the AI crawlers must be allowed in your robots.txt, and your content must exist in the raw HTML they fetch. The major AI crawlers do not execute JavaScript, so client-rendered pages are effectively invisible to them. And llms.txt, despite the hype, is an unofficial proposal that Google has publicly said it does not use.

AI Crawlers 101: Is Your Site Even Letting GPTBot and ClaudeBot In?

If you want your brand to show up when someone asks ChatGPT, Claude, or Perplexity a question, two things have to be true. First, the AI crawlers have to be allowed into your site in your robots.txt file. Second, your actual content has to exist in the raw HTML those crawlers fetch, because the major AI crawlers do not run JavaScript. Get either one wrong and you are invisible to AI search, no matter how good your content is.

That is the whole game, and most sites get at least one of those two things wrong without realizing it. Let's walk through what these bots are, how they behave, and the two-line robots.txt and rendering fixes that decide whether you exist in AI answers.

Who are the major AI crawlers, and what do they actually do?

There is no single "AI bot." Each major vendor runs several crawlers with different jobs, and they are controlled separately. Lumping them together is the first mistake teams make.

OpenAI runs three. GPTBot collects content that may be used to train future models. OAI-SearchBot discovers and indexes pages so they can surface in ChatGPT search results. ChatGPT-User fires when a person asks ChatGPT to fetch a specific URL during a session. OpenAI is explicit that each is independent: you can allow OAI-SearchBot so you appear in search results while disallowing GPTBot so your content is not used for training.

Anthropic mirrors this structure. Per Anthropic's own documentation, ClaudeBot is the training crawler, Claude-User fetches pages when a Claude user asks a question, and Claude-SearchBot indexes content for search-style answers. Blocking ClaudeBot stops training collection but does nothing about the two crawlers that put you in front of actual users.

Perplexity runs PerplexityBot for indexing and Perplexity-User for live, user-initiated fetches.

The practical takeaway: the crawlers that matter most for visibility are the search and user-facing ones (OAI-SearchBot, Claude-SearchBot, PerplexityBot, and the user fetchers), not the training bots. If your goal is to be cited and recommended, those are the ones you want in the door. This is the foundation of any serious AI search optimization program.

Why does server-rendered content matter so much for AI?

Here is the part that quietly breaks the most sites. Googlebot has a full rendering service that executes JavaScript, waits for frameworks to hydrate, and indexes content that only appears after the browser builds it. The major AI crawlers do not.

When Vercel and Merj analyzed roughly a billion crawler requests, they found that none of the major AI crawlers rendered JavaScript. GPTBot and ClaudeBot do fetch some JavaScript files, but the data showed no evidence they execute them. A separate analysis of over 500 million GPTBot fetches reached the same conclusion: zero evidence of JavaScript execution. They request the raw HTML, extract what is there, and leave.

The consequence is blunt. If your headlines, body copy, product details, or pricing only appear after client-side rendering, an AI crawler sees an empty shell. Your most important content is, for these bots, simply not there.

This is why the architecture of your site is an AI-visibility decision, not just an engineering one. Server-side rendering (SSR), static generation, or prerendering with frameworks like Next.js or Nuxt puts your content in the initial HTML response, where every crawler, AI or otherwise, can read it. If you are running a single-page app that assembles everything in the browser, fixing how your content renders is the highest-leverage thing you can do for AI search, and it is squarely a web development problem. A quick test: view the page source (not the inspector) or fetch the URL with curl. If your core content is not in that raw response, neither AI crawlers nor your future AI traffic will find it.

How do you allow-list AI crawlers in robots.txt?

Once your content is server-rendered, you control access with robots.txt at the root of each domain. Allowing a crawler is the default; you only need explicit rules when you want to permit or block specific bots. Each user-agent is handled on its own line, which is exactly why allowing one crawler does not touch the others.

A configuration that welcomes AI search while declining training use might look like this:

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

That setup says: index me for AI search and let assistants fetch my pages for users, but do not use my content for model training. Plenty of brands make the opposite call and allow everything. There is no universally correct answer; the point is that it is a deliberate choice you control per user-agent, and you should make it on purpose rather than by accident.

Two cautions. Do not try to enforce these decisions by blocking IP addresses, because that can stop a crawler from even reading your robots.txt, which defeats the purpose. And remember robots.txt is a per-host file, so you need it on every subdomain you care about.

How do you verify a crawler is genuine?

Bad actors spoof user-agent strings, so a string alone is not proof. For higher-confidence identification, compare the visiting IP against the vendor's published JSON range files. OpenAI, Anthropic, and Perplexity all publish machine-readable IP ranges for exactly this purpose. If the user-agent says GPTBot but the IP is not in OpenAI's list, treat it as fake. This is the same verification logic you would apply to Googlebot, and it belongs in your log analysis and analytics setup.

What about llms.txt? Is it real?

You have probably heard you need an llms.txt file. The honest answer in mid-2026: it is an unofficial proposal, not a standard, and the major AI companies have not adopted it. When asked whether the presence of llms.txt files on some Google properties amounted to an endorsement, Google's John Mueller answered directly that it did not, and Google has repeatedly said it does not use the file. Unlike robots.txt, llms.txt has no enforcement and no meaningful uptake across OpenAI, Anthropic, Google, or Perplexity.

So should you create one? Adding it will not hurt, and if a future tool adopts it you are ready. But do not mistake it for the work. The two levers that actually determine whether AI systems can see and cite you are the ones above: server-rendered HTML so the content exists, and correct robots.txt rules so the right crawlers are allowed in. Everything else is optional polish on top of those fundamentals.

The short version

Letting GPTBot, ClaudeBot, and the rest "in" is really two questions. Can they read your content? That depends on server-side rendering, because they do not execute JavaScript. Are they allowed to? That depends on a handful of robots.txt lines you control per crawler. Fix those two things and you are genuinely accessible to AI search. Skip them and the most polished content in your industry never makes it into a single AI answer.

If you want help auditing what AI crawlers can actually see on your site, that is exactly what our AI search and web development teams do.

Sources

FAQ

Quick
answers.

At minimum, the ones tied to AI search and live answers: OpenAI's OAI-SearchBot and ChatGPT-User, Anthropic's Claude-SearchBot and Claude-User, and PerplexityBot. Training crawlers like GPTBot and ClaudeBot are a separate, optional decision. Each user-agent is controlled independently, so allowing one does not allow the others.

Keep reading

Go deeper.

Guide

Why Your Brand Isn't Showing Up in ChatGPT (and How to Fix It)

If your brand never appears in ChatGPT, it usually comes down to three fixable causes: each AI engine cites differently and you are optimizing for the wrong one, your brand search volume is too low to register as a known entity, or your site is technically unreadable to AI crawlers.

Guide

What Is Generative Engine Optimization (GEO)?

Generative Engine Optimization (GEO) is the practice of structuring your content so generative AI engines (ChatGPT, Google AI Mode, Perplexity, Gemini) cite, quote, and recommend your brand inside their answers.

Guide

How to Measure AI Search Visibility in 2026

Measuring AI search visibility in 2026 means tracking three things: your citation share inside each AI engine (ChatGPT, Google AI Mode, Perplexity, Gemini), the AI-referred traffic those citations produce, and how AI engines mention your brand.

Glossary

llms.txt

llms.txt is a proposed plain-text Markdown file placed at a site's root to give large language models a concise, curated guide to its most important pages.

Glossary

Structured Data

Structured data is standardized code, usually schema.org vocabulary added in JSON-LD, that describes a page's content so search engines can understand it and display rich results such as reviews, FAQs, products, events, and breadcrumbs.

Glossary

Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) is an AI technique that retrieves relevant information from an external knowledge source at query time and feeds it to a large language model as context before it generates an answer.

Glossary

Core Web Vitals

Core Web Vitals are Google's set of three field metrics for measuring real-world user experience: Largest Contentful Paint (LCP) for loading, Interaction to Next Paint (INP) for responsiveness, and Cumulative Layout Shift (CLS) for visual stability.

Your growth starts here

Let's build the
growth engine.

Tell us where growth is stuck. We'll show you what one integrated team can move — and how fast.