All systems operationalIP pool status
Coronium Mobile Proxies
Web Scraping & AI · Legal · May 2026 · 12-min read

Is Web Scraping Legal in 2026? hiQ, Meta v Bright Data, Reddit v Perplexity and the New Rules

The short answer: scraping public, logged-off data is generally legal in the US — but the AI-training frontier is being fought in court right now. Here's the researched, plain-English map of the cases, the risk lines, and the compliant way to collect public data.

Coronium Technical Team
Published May 27, 2026
Verified 2026-05-27

This is general information, not legal advice. Web scraping law varies by jurisdiction, data type, and facts. Consult a qualified attorney before relying on anything here.

TL;DR

Public, logged-off data: generally legal (hiQ v LinkedIn; Meta v Bright Data). Logged-in / behind a ToS you accepted: contract risk. Circumventing anti-bot measures: DMCA §1201 risk (Reddit v Perplexity). Personal data: GDPR / DPDP applies even if "public." The safe path is collecting public, non-personal pages as a normal visitor on a real residential/mobile IP, without defeating access barriers, with everything documented.

The four laws that actually matter

"Is scraping legal" is the wrong question — it's really four separate questions under four different bodies of law. Sort your use case into these buckets and the risk picture gets clear fast.

CFAA (computer access)

The "hacking" statute. Post-hiQ, accessing public data without an authentication gate is not "unauthorized access." This is why public scraping survived.

Contract / Terms of Service

Breaching a ToS is a contract matter, not a crime. It mostly bites when you logged in and agreed to the terms first.

DMCA §1201 (anti-circumvention)

Defeating a "technological protection measure" — rate limits, anti-bot systems, login walls — can be a violation independent of whether the data was public. The new battleground.

Privacy law (GDPR, DPDP, state laws)

Governs the data, not the access. Personal data is regulated even when it's publicly visible.

The cases that built the public-data rule

hiQ Labs v. LinkedIn (9th Cir.)

hiQ scraped public LinkedIn profiles. The court held the CFAA's "without authorization" applies to authentication-gated systems, not public pages. Scraping public, logged-off data is not a CFAA crime. (hiQ separately lost on a ToS/contract theory — the two are different claims.)

Meta v. Bright Data (N.D. Cal., Jan 2024)

The court ruled Meta's ToS bar scraping by logged-in users but don't govern logged-off scraping of public data. Meta dropped the case weeks later. Reinforced hiQ and gave the data-collection industry a clearer green light for public pages.

The AI-training frontier: where it gets unsettled

The public-data cases predate the AI gold rush. The new wave of litigation reframes the question from "was it public?" to "did you defeat a protection to get it, and what did you do with it?"

Reddit v. Perplexity (2025, pending)

Reddit's central claim is DMCA §1201 — alleging defendants circumvented rate limits and anti-bot systems to scrape content for AI. The shift from CFAA to §1201 is the story: it targets the circumvention, not the publicness of the data.

YouTube creators v. Nvidia, Snap, Meta

Creators have sued over alleged scraping of YouTube videos to train AI models, on similar circumvention and IP theories. Outcomes pending — but the volume of suits signals the AI-training use of scraped data is the hot legal zone.

Connect this to the infrastructure side in The Closing Web in 2026: the same anti-bot measures now central to §1201 claims are what Cloudflare's default-block and Pay-Per-Crawl enforce at the network layer.

"Public" does not mean "unregulated"

Even when scraping is permitted, privacy law governs the data itself. A name, email, or photo on a public page is still personal data:

  • GDPR (EU): applies to EU residents' personal data wherever you operate; needs a lawful basis and data-minimization.
  • India DPDP Act: similar consent/notice obligations for personal data.
  • US state laws (CCPA/CPRA and others): rights over personal data, including some publicly available data.

Lowest-risk path: scrape aggregate, non-personal, public information. The moment you touch personal data, a second layer of law applies regardless of how legal the access was.

The compliant playbook

Do

  • Collect public, logged-off data only
  • Rate-limit respectfully; don't degrade the site
  • Prefer aggregate, non-personal data
  • Document sources, ToS review & methods
  • Present as a normal visitor on a real IP

Don't

  • Circumvent anti-bot / rate-limit systems (§1201)
  • Scrape behind logins you accepted ToS for
  • Harvest personal data without a lawful basis
  • Hammer a site so hard it causes harm
  • Assume "public" means "no privacy law"

The infrastructure angle: collecting public pages through a real browser on a residential or mobile carrier IP keeps you on the public-data side of the line — a normal visitor, not a declared bot defeating barriers. Pair it with the detection realities in how websites detect proxies in 2026.

FAQ

Collect public data the legitimate way

Real residential/mobile carrier IPs let you reach public pages as a genuine visitor — without defeating access barriers. Dedicated 4G/5G across 20+ countries.