The open-data era is ending. Cloudflare blocks AI crawlers by default and charges them at a 402 paywall, millions of sites have opted out of AI training, and the courts are redrawing the lines. Here's the researched, no-hype map of the data wars — and why real residential and mobile IPs are how legitimate public-data collection survives.
In 2026 the web is closing to declared AI crawlers: Cloudflare blocks them by default and added Pay-Per-Crawl (a 402 paywall), 2.5M+ sites disallow AI training, and robots.txt/llms.txt are weak controls. Meanwhile the courts (hiQ, Meta v Bright Data) keep public, logged-off scraping legal while AI-training cases (Reddit v Perplexity) test the limits. The takeaway: the blocks target bots that announce themselves from datacenter IPs. Legitimate public-data collection through a real browser on a residential/mobile IP — as a normal visitor, on public pages, without circumventing access barriers — is how compliant collection survives.
For two decades the deal was simple: if a page was public, a crawler could read it. The AI boom broke that deal. As models got hungrier, the volume exploded — GPTBot requests rose 147% in a single year, and Meta-ExternalAgent rose 843%. AI-related bot traffic climbed over 300% between January 2025 and March 2026. Publishers noticed their bandwidth and their content being consumed to train competitors — and they started slamming doors.
The result is a structural shift the scraping world is still adjusting to: the web is sorting itself into open pages, blocked pages, and — new in 2026 — paid pages. Understanding which is which, and how the blocking actually works, is now part of any serious data strategy.
Cloudflare became the first major infrastructure provider to block AI crawlers by default. Every new domain is now asked, up front, whether AI crawlers may scrape it. Over 2.5 million siteshad chosen to fully disallow AI training as of August 2025 — and the trend hardened: sites moved from "partially disallowed" to flat-out "fully disallowed" for GPTBot, CCBot and Google-Extended.
One-click AI-bot blocking; default-deny on new domains.
Pay-Per-Crawl serves a 402 "Payment Required"; publishers set rates, AI cos choose to pay.
Opt-in, with analytics for granular per-crawler control.
Early Pay-Per-Crawl testing on Stack Overflow's public dataset reportedly cut unauthorized bot traffic ~32% and lifted data-licensing revenue ~27%. Cloudflare even shipped a /crawl endpoint (March 10, 2026) in its Browser Rendering service — becoming, ironically, a scraping provider itself. We break the economics down in Cloudflare Pay-Per-Crawl: why mobile proxies are now essential.
There's a lot of confusion here, so be precise:
It requests behavior. GPTBot and OAI-SearchBot respect it; some crawlers ignore it. For roughly half of AI traffic in 2026, robots.txt is a signal that gets ignored — which is why enforcement is moving to the WAF/network layer.
A Markdown file describing what your site IS so AI models can navigate it. It cannot restrict any crawler. As of Q1 2026 no major AI company (OpenAI, Google, Anthropic, Meta, Mistral) reads it in production — adoption has effectively flatlined.
Bot categorization + WAF rules (Cloudflare, DataDome, Akamai) are what actually stop or charge crawlers, by user-agent and IP range. This is the layer that issues the 402 or the 403.
The practical lesson: a declared AI bot from a known datacenter range is the easiest thing in the world to block or bill. How you present at the network layer matters more than any text file.
The blocking story runs parallel to a legal one. Two threads matter:
hiQ v. LinkedIn (9th Cir.): scraping data accessible without authentication isn't "unauthorized access" under the CFAA. Meta v. Bright Data (Jan 2024): Meta's ToS only bar logged-in scraping, not logged-off scraping of public content — Meta dropped the suit weeks later.
Reddit v. Perplexity (late 2025) invokes DMCA §1201, alleging circumvention of rate limits and anti-bot systems — pending. YouTube creators sued Nvidia, then Snap and Meta, on similar §1201 theories. Privacy laws (EU, India's DPDP Act) add another layer.
The dividing line emerging from the case law: collecting publicly accessible data without authentication or circumventing technical barriers sits on solid ground; bypassing anti-bot measures or scraping behind logins is where the §1201 risk lives. Full breakdown in our dedicated piece: is web scraping legal in 2026 (coming in this cluster).
Put the blocking and the law together and a clear, compliant path emerges. The blocks and the 402 paywall target bots that announce themselves — GPTBot, CCBot, Meta-ExternalAgent — by user-agent and known datacenter IP ranges. Collecting public pages through a real browser on a real residential/mobile IP simply isn't that.
The Coronium angle: when the AI-bot front door gets a 402 or a 403, real 4G/5G carrier IPs let you collect public data as a genuine mobile user — the highest-trust network identity, across 20+ countries, with the egress under your control.
The 402 paywall economics and why mobile proxies matter.
The 7-layer detection stack you must pass to look like a real visitor.
Why agentic scrapers get blocked on datacenter IPs.
The broader AI-vs-publisher scraping conflict.
Real carrier IPs for legitimate public-data collection.
Commercial landing for scraping workloads.