What is Cloudflare Pay-Per-Crawl?

Pay-Per-Crawl is a Cloudflare marketplace where publishers can charge AI companies each time a page is crawled. Technically it serves a 402 'Payment Required' response to identified AI crawlers instead of just allowing or blocking them — a third option: charging. Publishers set their own rates; AI companies decide whether to pay. Early testing on Stack Overflow's public dataset reportedly cut unauthorized bot traffic ~32% and raised data-licensing revenue ~27%. See our deep-dive: Cloudflare Pay-Per-Crawl and mobile proxies.

Does robots.txt or llms.txt actually stop AI crawlers?

robots.txt is a signal, not a contract — and for roughly half of AI traffic in 2026 it's a signal that gets ignored. GPTBot and OAI-SearchBot respect robots.txt; some other crawlers do not. llms.txt is a different thing entirely: a Markdown summary that tells AI models what your site IS — it cannot restrict any crawler, and as of Q1 2026 no major AI company (OpenAI, Google, Anthropic, Meta, Mistral) reads it in production. Enforcement increasingly happens at the network/WAF layer, not in a text file.

Why does this make residential and mobile IPs more important, not less?

The blocking and the 402 paywall target identified AI crawlers — GPTBot, CCBot, Meta-ExternalAgent — by user-agent and known IP ranges. Legitimate collection of publicly accessible data through a normal browser on a real residential or mobile carrier IP doesn't present as those bots. When the datacenter-IP, declared-AI-bot path gets blocked or charged, the high-trust real-user path is how compliant public-data collection continues. It's the difference between knocking on the front door labeled 'AI bot' and walking in like a normal visitor.

Is scraping public web data still legal in 2026?

Scraping publicly visible data that doesn't require login or bypassing protections is generally legal in the US. hiQ v. LinkedIn (9th Circuit) held that scraping data accessible without authentication isn't 'unauthorized access' under the CFAA. Meta v. Bright Data (Jan 2024) reinforced this — the court ruled Meta's ToS only prohibit logged-in scraping, not logged-off scraping of public content, and Meta dropped the suit. The new frontier is AI training: Reddit sued Perplexity (late 2025) under DMCA §1201 alleging circumvention of anti-bot measures, and YouTube creators have sued Nvidia, Meta and Snap. The rules around AI training data are still being written.

What's the safe way to collect public web data now?

Document your sources, check for AI-specific ToS clauses, rate-limit respectfully, scrape only publicly accessible (logged-off) data, and avoid circumventing technical access barriers (the DMCA §1201 risk). On the infrastructure side, use real residential/mobile IPs and a real browser stack so you collect public pages as a normal visitor rather than as a declared AI training bot. Build compliance into the workflow — the teams that operate safely are the ones who can show how and what they collected.

All systems operational•IP pool status

Dashboard Login/Signup Purchase Guide All Proxies

Web Scraping & AI · Pillar · May 2026 · 14-min read

The Closing Web in 2026: How AI Crawler Blocking and Pay-Per-Crawl Changed Web Scraping

Q: Is the open web really closing to AI crawlers in 2026?

Partly, yes. Cloudflare — which fronts a large share of the web — became the first major infrastructure provider to block AI crawlers by default and now asks every new domain whether to allow them. Over 2.5 million sites had chosen to fully disallow AI training via Cloudflare's managed robots.txt as of August 2025, and that number has grown. About 18.7% of all websites now block GPTBot specifically. The 'free-for-all' era of AI scraping is over for the sites that opt out.

The open-data era is ending. Cloudflare blocks AI crawlers by default and charges them at a 402 paywall, millions of sites have opted out of AI training, and the courts are redrawing the lines. Here's the researched, no-hype map of the data wars — and why real residential and mobile IPs are how legitimate public-data collection survives.

Coronium Technical Team

Published May 26, 2026

Verified 2026-05-26

2.5M+

Sites blocking AI training

402

Pay-Per-Crawl paywall

+300%

AI bot traffic, Jan25→Mar26

18.7%

Sites blocking GPTBot

TL;DR

In 2026 the web is closing to declared AI crawlers: Cloudflare blocks them by default and added Pay-Per-Crawl (a 402 paywall), 2.5M+ sites disallow AI training, and robots.txt/llms.txt are weak controls. Meanwhile the courts (hiQ, Meta v Bright Data) keep public, logged-off scraping legal while AI-training cases (Reddit v Perplexity) test the limits. The takeaway: the blocks target bots that announce themselves from datacenter IPs. Legitimate public-data collection through a real browser on a residential/mobile IP — as a normal visitor, on public pages, without circumventing access barriers — is how compliant collection survives.

On this page

The shift
Cloudflare & Pay-Per-Crawl
robots.txt vs llms.txt
The legal frontier
What still works
FAQ

From open data to gated data

For two decades the deal was simple: if a page was public, a crawler could read it. The AI boom broke that deal. As models got hungrier, the volume exploded — GPTBot requests rose 147% in a single year, and Meta-ExternalAgent rose 843%. AI-related bot traffic climbed over 300% between January 2025 and March 2026. Publishers noticed their bandwidth and their content being consumed to train competitors — and they started slamming doors.

The result is a structural shift the scraping world is still adjusting to: the web is sorting itself into open pages, blocked pages, and — new in 2026 — paid pages. Understanding which is which, and how the blocking actually works, is now part of any serious data strategy.

Cloudflare flipped the default — and added a paywall

Cloudflare became the first major infrastructure provider to block AI crawlers by default. Every new domain is now asked, up front, whether AI crawlers may scrape it. Over 2.5 million siteshad chosen to fully disallow AI training as of August 2025 — and the trend hardened: sites moved from "partially disallowed" to flat-out "fully disallowed" for GPTBot, CCBot and Google-Extended.

Block

One-click AI-bot blocking; default-deny on new domains.

Charge (402)

Pay-Per-Crawl serves a 402 "Payment Required"; publishers set rates, AI cos choose to pay.

Allow

Opt-in, with analytics for granular per-crawler control.

Early Pay-Per-Crawl testing on Stack Overflow's public dataset reportedly cut unauthorized bot traffic ~32% and lifted data-licensing revenue ~27%. Cloudflare even shipped a /crawl endpoint (March 10, 2026) in its Browser Rendering service — becoming, ironically, a scraping provider itself. We break the economics down in Cloudflare Pay-Per-Crawl: why mobile proxies are now essential.

robots.txt vs llms.txt: what actually controls AI crawlers

There's a lot of confusion here, so be precise:

robots.txt — a signal, not a contract

It requests behavior. GPTBot and OAI-SearchBot respect it; some crawlers ignore it. For roughly half of AI traffic in 2026, robots.txt is a signal that gets ignored — which is why enforcement is moving to the WAF/network layer.

llms.txt — a menu, not a lock

A Markdown file describing what your site IS so AI models can navigate it. It cannot restrict any crawler. As of Q1 2026 no major AI company (OpenAI, Google, Anthropic, Meta, Mistral) reads it in production — adoption has effectively flatlined.

WAF / network blocking — the real control

Bot categorization + WAF rules (Cloudflare, DataDome, Akamai) are what actually stop or charge crawlers, by user-agent and IP range. This is the layer that issues the 402 or the 403.

The practical lesson: a declared AI bot from a known datacenter range is the easiest thing in the world to block or bill. How you present at the network layer matters more than any text file.

The legal frontier: public data vs AI training

The blocking story runs parallel to a legal one. Two threads matter:

Public, logged-off scraping — generally legal (US)

hiQ v. LinkedIn (9th Cir.): scraping data accessible without authentication isn't "unauthorized access" under the CFAA. Meta v. Bright Data (Jan 2024): Meta's ToS only bar logged-in scraping, not logged-off scraping of public content — Meta dropped the suit weeks later.

AI training data — unsettled, litigated

Reddit v. Perplexity (late 2025) invokes DMCA §1201, alleging circumvention of rate limits and anti-bot systems — pending. YouTube creators sued Nvidia, then Snap and Meta, on similar §1201 theories. Privacy laws (EU, India's DPDP Act) add another layer.

The dividing line emerging from the case law: collecting publicly accessible data without authentication or circumventing technical barriers sits on solid ground; bypassing anti-bot measures or scraping behind logins is where the §1201 risk lives. Full breakdown in our dedicated piece: is web scraping legal in 2026.

What still works for legitimate public-data collection

Put the blocking and the law together and a clear, compliant path emerges. The blocks and the 402 paywall target bots that announce themselves — GPTBot, CCBot, Meta-ExternalAgent — by user-agent and known datacenter IP ranges. Collecting public pages through a real browser on a real residential/mobile IP simply isn't that.

Collect only public, logged-off data — stay on the hiQ / Meta-v-Bright-Data side of the line.
Don't circumvent access barriers — that's the DMCA §1201 risk; rate-limit respectfully.
Present as a normal visitor — a real browser stack on a residential or mobile carrier IP, not a declared AI bot from a cloud range.
Match the whole stack — the IP is necessary but not sufficient; see how websites detect proxies in 2026.
Document everything — sources, ToS checks, rate limits. Compliance is now part of the workflow.

The Coronium angle: when the AI-bot front door gets a 402 or a 403, real 4G/5G carrier IPs let you collect public data as a genuine mobile user — the highest-trust network identity, across 20+ countries, with the egress under your control.

FAQ

Related resources

Is web scraping legal in 2026?

hiQ, Meta v Bright Data, Reddit v Perplexity & DMCA §1201.

The EU AI Act in 2026

Aug-2026 enforcement, GPAI training-data disclosure & the copyright opt-out.

robots.txt vs llms.txt vs ai.txt

What each file does, what it can't, and why the WAF is the real control.

Scraping in the Agentic Era (MCP)

How AI agents collect web data and why the IP layer decides.

Aggregating public government data at scale

Joining 30+ fragmented public sources into one clean dataset — sources, dedup, lineage, GDPR.

AI Crawler Checker (free tool)

See which AI bots a domain allows or blocks in its robots.txt.

Cloudflare Pay-Per-Crawl deep-dive

The 402 paywall economics and why mobile proxies matter.

How websites detect proxies in 2026

The 7-layer detection stack you must pass to look like a real visitor.

AI browser agents need mobile proxies

Why agentic scrapers get blocked on datacenter IPs.

Web scraping proxies

Commercial landing for scraping workloads.

Collect public data the legitimate way

When the AI-bot door gets a 402, real residential/mobile carrier IPs let you reach public pages as a genuine visitor. Dedicated 4G/5G across 20+ countries.