Three files, three jobs, and an enormous amount of confusion. One controls access, one is a content map, one is a training opt-out — and none of them is a lock. Here's the researched, no-hype breakdown of what each does, what it can't, and where real enforcement actually happens.
robots.txt controls crawler access — GPTBot & Google-Extended honor it, but it's a request, not a wall. llms.txt is a Markdown content map; ~10% adoption across 300K domains, Google said no on the record, almost no crawler reads it — its real value is feeding AI coding assistants clean docs. ai.txt is a training opt-out tied to Spawning's Do-Not-Train registry (honored by Stability, Hugging Face). None is enforced — actual control lives at the WAF / network layer.
The oldest of the three (the Robots Exclusion Protocol dates to 1994) tells crawlers which paths they may fetch. In the AI era it grew new user-agents. The honest ones obey it:
# Block OpenAI's training + search crawlers
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
# Opt out of Google Gemini/Vertex training,
# while KEEPING normal Search indexing
User-agent: Google-Extended
Disallow: /
# Common Crawl (feeds many AI datasets)
User-agent: CCBot
Disallow: /GPTBot and OAI-SearchBot honor robots.txt — OpenAI even recommends robots.txt (not llms.txt) as the way to control its crawlers. Google-Extended is the useful nuance: it opts you out of Gemini/Vertex AI training without hurting Search rankings. But robots.txt is a signal — a polite request a non-compliant bot can ignore. That gap is why the rest of this article exists.
Proposed in 2024, llms.txt is a Markdown file at your root: an H1 with your site name, a blockquote summary, and curated links to your most important pages. The idea: hand an AI model a clean, distilled map so it summarizes and cites you accurately instead of guessing from messy HTML. Crucially, it restricts nothing — it's a menu, not a lock.
So is it useless? No — it's just misunderstood. Its real, working use case is developer documentation: Stripe, Vercel, Cloudflare, Anthropic, Coinbase, Pinecone and Cursor ship llms.txt because their users build with AI coding assistants right now, and a curated file is the difference between Cursor or Claude Code generating a working integration and hallucinating one. If you don't have docs or an API that assistants consume, llms.txt is low-ROI SEO theater.
ai.txt is a proposed root-level file that allows or denies AI developers the right to use a domain's text and media for training — a consent layer distinct from access (robots.txt) and discovery (llms.txt). It's most closely associated with Spawning.ai's Do-Not-Train (DNT) registry, which provides machine-readable opt-out tooling for content owners.
The meaningful detail: major AI developers including Stability AI and Hugging Facehave agreed to honor Spawning's DNT opt-outs. But like robots.txt, ai.txt is a consent signal — it works only when the crawler chooses to respect it, and it is not technically enforced. It expresses a wish; it doesn't impose one.
All three files share a weakness: they depend on the crawler's goodwill. The only layer with teeth is the WAF / CDN. Bot categorization plus WAF rules (Cloudflare, DataDome, Akamai) identify crawlers by user-agent and IP range, then do one of three things:
The text files express intent; the WAF imposes it. This is the central thesis of The Closing Web in 2026 — and the reason a declared AI bot from a datacenter range is the easiest thing on the internet to block or bill, while a real visitor on a residential/mobile IP is not.
On the collection side: if you operate on the data-gathering end, the same logic runs in reverse — a declared bot gets filtered, a real visitor on a residential/mobile IP doesn't. Want to see how a given site treats AI bots? Run our free AI Crawler Checker.
AI crawler blocking, Pay-Per-Crawl, and the data wars in full.
hiQ, Meta v Bright Data, Reddit v Perplexity & DMCA §1201.
See which AI bots a domain allows or blocks in its robots.txt.
The 402 paywall — the enforcement layer with teeth.
The 7-layer detection stack you must pass as a real visitor.
Real 4G/5G carrier IPs for legitimate public-data collection.