All systems operationalIP pool status
Coronium Mobile Proxies
Web Scraping & AI · Pillar · May 2026 · 14-min read

The Closing Web in 2026: How AI Crawler Blocking and Pay-Per-Crawl Changed Web Scraping

The open-data era is ending. Cloudflare blocks AI crawlers by default and charges them at a 402 paywall, millions of sites have opted out of AI training, and the courts are redrawing the lines. Here's the researched, no-hype map of the data wars — and why real residential and mobile IPs are how legitimate public-data collection survives.

Coronium Technical Team
Published May 26, 2026
Verified 2026-05-26
2.5M+
Sites blocking AI training
402
Pay-Per-Crawl paywall
+300%
AI bot traffic, Jan25→Mar26
18.7%
Sites blocking GPTBot

TL;DR

In 2026 the web is closing to declared AI crawlers: Cloudflare blocks them by default and added Pay-Per-Crawl (a 402 paywall), 2.5M+ sites disallow AI training, and robots.txt/llms.txt are weak controls. Meanwhile the courts (hiQ, Meta v Bright Data) keep public, logged-off scraping legal while AI-training cases (Reddit v Perplexity) test the limits. The takeaway: the blocks target bots that announce themselves from datacenter IPs. Legitimate public-data collection through a real browser on a residential/mobile IP — as a normal visitor, on public pages, without circumventing access barriers — is how compliant collection survives.

From open data to gated data

For two decades the deal was simple: if a page was public, a crawler could read it. The AI boom broke that deal. As models got hungrier, the volume exploded — GPTBot requests rose 147% in a single year, and Meta-ExternalAgent rose 843%. AI-related bot traffic climbed over 300% between January 2025 and March 2026. Publishers noticed their bandwidth and their content being consumed to train competitors — and they started slamming doors.

The result is a structural shift the scraping world is still adjusting to: the web is sorting itself into open pages, blocked pages, and — new in 2026 — paid pages. Understanding which is which, and how the blocking actually works, is now part of any serious data strategy.

Cloudflare flipped the default — and added a paywall

Cloudflare became the first major infrastructure provider to block AI crawlers by default. Every new domain is now asked, up front, whether AI crawlers may scrape it. Over 2.5 million siteshad chosen to fully disallow AI training as of August 2025 — and the trend hardened: sites moved from "partially disallowed" to flat-out "fully disallowed" for GPTBot, CCBot and Google-Extended.

Block

One-click AI-bot blocking; default-deny on new domains.

Charge (402)

Pay-Per-Crawl serves a 402 "Payment Required"; publishers set rates, AI cos choose to pay.

Allow

Opt-in, with analytics for granular per-crawler control.

Early Pay-Per-Crawl testing on Stack Overflow's public dataset reportedly cut unauthorized bot traffic ~32% and lifted data-licensing revenue ~27%. Cloudflare even shipped a /crawl endpoint (March 10, 2026) in its Browser Rendering service — becoming, ironically, a scraping provider itself. We break the economics down in Cloudflare Pay-Per-Crawl: why mobile proxies are now essential.

robots.txt vs llms.txt: what actually controls AI crawlers

There's a lot of confusion here, so be precise:

robots.txt — a signal, not a contract

It requests behavior. GPTBot and OAI-SearchBot respect it; some crawlers ignore it. For roughly half of AI traffic in 2026, robots.txt is a signal that gets ignored — which is why enforcement is moving to the WAF/network layer.

llms.txt — a menu, not a lock

A Markdown file describing what your site IS so AI models can navigate it. It cannot restrict any crawler. As of Q1 2026 no major AI company (OpenAI, Google, Anthropic, Meta, Mistral) reads it in production — adoption has effectively flatlined.

WAF / network blocking — the real control

Bot categorization + WAF rules (Cloudflare, DataDome, Akamai) are what actually stop or charge crawlers, by user-agent and IP range. This is the layer that issues the 402 or the 403.

The practical lesson: a declared AI bot from a known datacenter range is the easiest thing in the world to block or bill. How you present at the network layer matters more than any text file.

What still works for legitimate public-data collection

Put the blocking and the law together and a clear, compliant path emerges. The blocks and the 402 paywall target bots that announce themselves — GPTBot, CCBot, Meta-ExternalAgent — by user-agent and known datacenter IP ranges. Collecting public pages through a real browser on a real residential/mobile IP simply isn't that.

  • Collect only public, logged-off data — stay on the hiQ / Meta-v-Bright-Data side of the line.
  • Don't circumvent access barriers — that's the DMCA §1201 risk; rate-limit respectfully.
  • Present as a normal visitor — a real browser stack on a residential or mobile carrier IP, not a declared AI bot from a cloud range.
  • Match the whole stack — the IP is necessary but not sufficient; see how websites detect proxies in 2026.
  • Document everything — sources, ToS checks, rate limits. Compliance is now part of the workflow.

The Coronium angle: when the AI-bot front door gets a 402 or a 403, real 4G/5G carrier IPs let you collect public data as a genuine mobile user — the highest-trust network identity, across 20+ countries, with the egress under your control.

FAQ

Collect public data the legitimate way

When the AI-bot door gets a 402, real residential/mobile carrier IPs let you reach public pages as a genuine visitor. Dedicated 4G/5G across 20+ countries.