What is the difference between robots.txt, llms.txt and ai.txt?

They solve three different problems. robots.txt controls crawler ACCESS — which paths a bot may fetch (a signal, not enforcement). llms.txt is a content MAP — a Markdown file that tells AI models what your site is and where the important content lives, so assistants can summarize and cite you accurately; it does not restrict anything. ai.txt is a training OPT-OUT — a machine-readable declaration (popularized alongside Spawning's Do-Not-Train registry) that says whether your text and media may be used to train AI models. Access vs. discovery vs. consent.

Does llms.txt actually work in 2026?

Mostly as a developer-docs convenience, not as an SEO or AI-ranking lever. An SE Ranking study of ~300,000 domains found roughly 10% adoption after about 18 months. Google said no on the record — Gary Illyes confirmed in July 2025 Google doesn't support llms.txt and isn't planning to, and John Mueller likened it to the long-discredited keywords meta tag. In one analysis of 500M+ AI-bot visits over 90 days, only a few hundred actually requested llms.txt. Where it genuinely shines: API and dev-tool companies (Stripe, Vercel, Cloudflare, Anthropic, Cursor) ship llms.txt so AI coding assistants generate correct integration code from a clean, curated source.

Do AI crawlers obey robots.txt?

Some do, some don't. OpenAI's GPTBot and OAI-SearchBot honor robots.txt, and OpenAI explicitly recommends robots.txt (not llms.txt) for crawler control. Google-Extended lets you opt out of Gemini/Vertex training while keeping Search indexing. But robots.txt is fundamentally a request, and a meaningful share of AI-related traffic ignores it — which is exactly why publishers moved enforcement to the WAF/network layer (Cloudflare's default block, Pay-Per-Crawl 402).

What is ai.txt and who honors it?

ai.txt is a proposed root-level file that allows or denies AI developers the right to use a domain's text and media for training. It's closely associated with Spawning.ai's Do-Not-Train (DNT) registry — and major AI developers including Stability AI and Hugging Face have agreed to honor Spawning's DNT opt-outs. Like robots.txt, it's a consent signal that depends on the crawler choosing to respect it; it is not technically enforced.

If these files are weak, what actually stops AI crawlers?

Enforcement at the network layer. Bot categorization + WAF rules (Cloudflare, DataDome, Akamai) identify crawlers by user-agent and IP range and then allow, block (403), or charge (402 via Cloudflare Pay-Per-Crawl). That's the layer that has teeth — text files express intent, the WAF imposes it. See our pillar on the closing web for the full picture.

Should I ship all three files?

For most sites: keep a clean robots.txt with explicit AI-bot rules (GPTBot, Google-Extended, CCBot, etc.); add ai.txt / a Spawning DNT entry if you want a training opt-out signal; ship llms.txt only if you have docs or an API that AI coding assistants consume — otherwise it's low ROI. And remember none of them is a lock: if you need actual control, configure your CDN/WAF.

All systems operational•IP pool status

Dashboard Login/Signup Purchase Guide All Proxies

Web Scraping & AI · Standards · May 2026 · 11-min read

robots.txt vs llms.txt vs ai.txt in 2026: What Actually Controls AI Crawlers (and What Doesn't)

Three files, three jobs, and an enormous amount of confusion. One controls access, one is a content map, one is a training opt-out — and none of them is a lock. Here's the researched, no-hype breakdown of what each does, what it can't, and where real enforcement actually happens.

Coronium Technical Team

Published May 27, 2026

Verified 2026-05-27

robots.txt

Access — a signal

llms.txt

Discovery — a map

ai.txt

Training — consent

TL;DR

robots.txt controls crawler access — GPTBot & Google-Extended honor it, but it's a request, not a wall. llms.txt is a Markdown content map; ~10% adoption across 300K domains, Google said no on the record, almost no crawler reads it — its real value is feeding AI coding assistants clean docs. ai.txt is a training opt-out tied to Spawning's Do-Not-Train registry (honored by Stability, Hugging Face). None is enforced — actual control lives at the WAF / network layer.

On this page

robots.txt
llms.txt
ai.txt
Real enforcement
What to ship
FAQ

robots.txt — access control, by request

The oldest of the three (the Robots Exclusion Protocol dates to 1994) tells crawlers which paths they may fetch. In the AI era it grew new user-agents. The honest ones obey it:

# Block OpenAI's training + search crawlers
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

# Opt out of Google Gemini/Vertex training,
# while KEEPING normal Search indexing
User-agent: Google-Extended
Disallow: /

# Common Crawl (feeds many AI datasets)
User-agent: CCBot
Disallow: /

GPTBot and OAI-SearchBot honor robots.txt — OpenAI even recommends robots.txt (not llms.txt) as the way to control its crawlers. Google-Extended is the useful nuance: it opts you out of Gemini/Vertex AI training without hurting Search rankings. But robots.txt is a signal — a polite request a non-compliant bot can ignore. That gap is why the rest of this article exists.

llms.txt — a content map, not a control

Proposed in 2024, llms.txt is a Markdown file at your root: an H1 with your site name, a blockquote summary, and curated links to your most important pages. The idea: hand an AI model a clean, distilled map so it summarizes and cites you accurately instead of guessing from messy HTML. Crucially, it restricts nothing — it's a menu, not a lock.

The adoption reality (2026)

• ~10% adoption — an SE Ranking study of ~300,000 domains, ~18 months after launch.
• Google said no — Gary Illyes (July 2025) confirmed Google doesn't support it; John Mueller compared it to the discredited keywords meta tag.
• Crawlers barely fetch it — in one analysis of 500M+ AI-bot visits over 90 days, only a few hundred requested llms.txt.
• OpenAI points to robots.txt, not llms.txt, for crawler control.

So is it useless? No — it's just misunderstood. Its real, working use case is developer documentation: Stripe, Vercel, Cloudflare, Anthropic, Coinbase, Pinecone and Cursor ship llms.txt because their users build with AI coding assistants right now, and a curated file is the difference between Cursor or Claude Code generating a working integration and hallucinating one. If you don't have docs or an API that assistants consume, llms.txt is low-ROI SEO theater.

ai.txt — a training opt-out

ai.txt is a proposed root-level file that allows or denies AI developers the right to use a domain's text and media for training — a consent layer distinct from access (robots.txt) and discovery (llms.txt). It's most closely associated with Spawning.ai's Do-Not-Train (DNT) registry, which provides machine-readable opt-out tooling for content owners.

The meaningful detail: major AI developers including Stability AI and Hugging Facehave agreed to honor Spawning's DNT opt-outs. But like robots.txt, ai.txt is a consent signal — it works only when the crawler chooses to respect it, and it is not technically enforced. It expresses a wish; it doesn't impose one.

What actually enforces anything: the network layer

All three files share a weakness: they depend on the crawler's goodwill. The only layer with teeth is the WAF / CDN. Bot categorization plus WAF rules (Cloudflare, DataDome, Akamai) identify crawlers by user-agent and IP range, then do one of three things:

Allow

Opt-in, with analytics

Block (403)

Cloudflare default-deny

Charge (402)

Pay-Per-Crawl

The text files express intent; the WAF imposes it. This is the central thesis of The Closing Web in 2026 — and the reason a declared AI bot from a datacenter range is the easiest thing on the internet to block or bill, while a real visitor on a residential/mobile IP is not.

What to actually ship

robots.txt with explicit AI-bot rules (GPTBot, Google-Extended, CCBot…). Use Google-Extended to opt out of Gemini training without losing Search.
ai.txt / Spawning DNT if you want a training-consent signal that at least some big labs honor.
llms.txt only if you publish docs or an API that AI coding assistants consume — otherwise skip it.
CDN/WAF rules if you need real control — that's the only layer that actually enforces.

On the collection side: if you operate on the data-gathering end, the same logic runs in reverse — a declared bot gets filtered, a real visitor on a residential/mobile IP doesn't. Want to see how a given site treats AI bots? Run our free AI Crawler Checker.

FAQ

Related resources

The Closing Web in 2026 (pillar)

AI crawler blocking, Pay-Per-Crawl, and the data wars in full.

Is web scraping legal in 2026?

hiQ, Meta v Bright Data, Reddit v Perplexity & DMCA §1201.

AI Crawler Checker (free tool)

See which AI bots a domain allows or blocks in its robots.txt.

Cloudflare Pay-Per-Crawl deep-dive

The 402 paywall — the enforcement layer with teeth.

How websites detect proxies in 2026

The 7-layer detection stack you must pass as a real visitor.

Web scraping proxies

Real 4G/5G carrier IPs for legitimate public-data collection.

Text files express intent — IPs decide outcomes

When you need to collect public data as a real visitor, real residential/mobile carrier IPs are the difference between answered and blocked. Dedicated 4G/5G across 20+ countries.