The short answer: scraping public, logged-off data is generally legal in the US — but the AI-training frontier is being fought in court right now. Here's the researched, plain-English map of the cases, the risk lines, and the compliant way to collect public data.
This is general information, not legal advice. Web scraping law varies by jurisdiction, data type, and facts. Consult a qualified attorney before relying on anything here.
Public, logged-off data: generally legal (hiQ v LinkedIn; Meta v Bright Data). Logged-in / behind a ToS you accepted: contract risk. Circumventing anti-bot measures: DMCA §1201 risk (Reddit v Perplexity). Personal data: GDPR / DPDP applies even if "public." The safe path is collecting public, non-personal pages as a normal visitor on a real residential/mobile IP, without defeating access barriers, with everything documented.
"Is scraping legal" is the wrong question — it's really four separate questions under four different bodies of law. Sort your use case into these buckets and the risk picture gets clear fast.
The "hacking" statute. Post-hiQ, accessing public data without an authentication gate is not "unauthorized access." This is why public scraping survived.
Breaching a ToS is a contract matter, not a crime. It mostly bites when you logged in and agreed to the terms first.
Defeating a "technological protection measure" — rate limits, anti-bot systems, login walls — can be a violation independent of whether the data was public. The new battleground.
Governs the data, not the access. Personal data is regulated even when it's publicly visible.
hiQ scraped public LinkedIn profiles. The court held the CFAA's "without authorization" applies to authentication-gated systems, not public pages. Scraping public, logged-off data is not a CFAA crime. (hiQ separately lost on a ToS/contract theory — the two are different claims.)
The court ruled Meta's ToS bar scraping by logged-in users but don't govern logged-off scraping of public data. Meta dropped the case weeks later. Reinforced hiQ and gave the data-collection industry a clearer green light for public pages.
The public-data cases predate the AI gold rush. The new wave of litigation reframes the question from "was it public?" to "did you defeat a protection to get it, and what did you do with it?"
Reddit's central claim is DMCA §1201 — alleging defendants circumvented rate limits and anti-bot systems to scrape content for AI. The shift from CFAA to §1201 is the story: it targets the circumvention, not the publicness of the data.
Creators have sued over alleged scraping of YouTube videos to train AI models, on similar circumvention and IP theories. Outcomes pending — but the volume of suits signals the AI-training use of scraped data is the hot legal zone.
Connect this to the infrastructure side in The Closing Web in 2026: the same anti-bot measures now central to §1201 claims are what Cloudflare's default-block and Pay-Per-Crawl enforce at the network layer.
Even when scraping is permitted, privacy law governs the data itself. A name, email, or photo on a public page is still personal data:
Lowest-risk path: scrape aggregate, non-personal, public information. The moment you touch personal data, a second layer of law applies regardless of how legal the access was.
The infrastructure angle: collecting public pages through a real browser on a residential or mobile carrier IP keeps you on the public-data side of the line — a normal visitor, not a declared bot defeating barriers. Pair it with the detection realities in how websites detect proxies in 2026.
AI crawler blocking, Pay-Per-Crawl, and the data wars in full.
The 402 paywall economics and the network-layer enforcement.
The 7-layer detection stack you must pass as a real visitor.
The broader AI-vs-publisher scraping conflict.
Real carrier IPs for legitimate public-data collection.
Commercial landing for scraping workloads.