All systems operationalIP pool status
Coronium Mobile Proxies
Web Scraping & AI · Standards · May 2026 · 11-min read

robots.txt vs llms.txt vs ai.txt in 2026: What Actually Controls AI Crawlers (and What Doesn't)

Three files, three jobs, and an enormous amount of confusion. One controls access, one is a content map, one is a training opt-out — and none of them is a lock. Here's the researched, no-hype breakdown of what each does, what it can't, and where real enforcement actually happens.

Coronium Technical Team
Published May 27, 2026
Verified 2026-05-27
robots.txt
Access — a signal
llms.txt
Discovery — a map
ai.txt
Training — consent

TL;DR

robots.txt controls crawler access — GPTBot & Google-Extended honor it, but it's a request, not a wall. llms.txt is a Markdown content map; ~10% adoption across 300K domains, Google said no on the record, almost no crawler reads it — its real value is feeding AI coding assistants clean docs. ai.txt is a training opt-out tied to Spawning's Do-Not-Train registry (honored by Stability, Hugging Face). None is enforced — actual control lives at the WAF / network layer.

robots.txt — access control, by request

The oldest of the three (the Robots Exclusion Protocol dates to 1994) tells crawlers which paths they may fetch. In the AI era it grew new user-agents. The honest ones obey it:

# Block OpenAI's training + search crawlers
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

# Opt out of Google Gemini/Vertex training,
# while KEEPING normal Search indexing
User-agent: Google-Extended
Disallow: /

# Common Crawl (feeds many AI datasets)
User-agent: CCBot
Disallow: /

GPTBot and OAI-SearchBot honor robots.txt — OpenAI even recommends robots.txt (not llms.txt) as the way to control its crawlers. Google-Extended is the useful nuance: it opts you out of Gemini/Vertex AI training without hurting Search rankings. But robots.txt is a signal — a polite request a non-compliant bot can ignore. That gap is why the rest of this article exists.

llms.txt — a content map, not a control

Proposed in 2024, llms.txt is a Markdown file at your root: an H1 with your site name, a blockquote summary, and curated links to your most important pages. The idea: hand an AI model a clean, distilled map so it summarizes and cites you accurately instead of guessing from messy HTML. Crucially, it restricts nothing — it's a menu, not a lock.

The adoption reality (2026)

  • ~10% adoption — an SE Ranking study of ~300,000 domains, ~18 months after launch.
  • Google said no — Gary Illyes (July 2025) confirmed Google doesn't support it; John Mueller compared it to the discredited keywords meta tag.
  • Crawlers barely fetch it — in one analysis of 500M+ AI-bot visits over 90 days, only a few hundred requested llms.txt.
  • OpenAI points to robots.txt, not llms.txt, for crawler control.

So is it useless? No — it's just misunderstood. Its real, working use case is developer documentation: Stripe, Vercel, Cloudflare, Anthropic, Coinbase, Pinecone and Cursor ship llms.txt because their users build with AI coding assistants right now, and a curated file is the difference between Cursor or Claude Code generating a working integration and hallucinating one. If you don't have docs or an API that assistants consume, llms.txt is low-ROI SEO theater.

ai.txt — a training opt-out

ai.txt is a proposed root-level file that allows or denies AI developers the right to use a domain's text and media for training — a consent layer distinct from access (robots.txt) and discovery (llms.txt). It's most closely associated with Spawning.ai's Do-Not-Train (DNT) registry, which provides machine-readable opt-out tooling for content owners.

The meaningful detail: major AI developers including Stability AI and Hugging Facehave agreed to honor Spawning's DNT opt-outs. But like robots.txt, ai.txt is a consent signal — it works only when the crawler chooses to respect it, and it is not technically enforced. It expresses a wish; it doesn't impose one.

What actually enforces anything: the network layer

All three files share a weakness: they depend on the crawler's goodwill. The only layer with teeth is the WAF / CDN. Bot categorization plus WAF rules (Cloudflare, DataDome, Akamai) identify crawlers by user-agent and IP range, then do one of three things:

Allow
Opt-in, with analytics
Block (403)
Cloudflare default-deny
Charge (402)
Pay-Per-Crawl

The text files express intent; the WAF imposes it. This is the central thesis of The Closing Web in 2026 — and the reason a declared AI bot from a datacenter range is the easiest thing on the internet to block or bill, while a real visitor on a residential/mobile IP is not.

What to actually ship

  • robots.txt with explicit AI-bot rules (GPTBot, Google-Extended, CCBot…). Use Google-Extended to opt out of Gemini training without losing Search.
  • ai.txt / Spawning DNT if you want a training-consent signal that at least some big labs honor.
  • llms.txt only if you publish docs or an API that AI coding assistants consume — otherwise skip it.
  • CDN/WAF rules if you need real control — that's the only layer that actually enforces.

On the collection side: if you operate on the data-gathering end, the same logic runs in reverse — a declared bot gets filtered, a real visitor on a residential/mobile IP doesn't. Want to see how a given site treats AI bots? Run our free AI Crawler Checker.

FAQ

Text files express intent — IPs decide outcomes

When you need to collect public data as a real visitor, real residential/mobile carrier IPs are the difference between answered and blocked. Dedicated 4G/5G across 20+ countries.