All systems operationalIP pool status
Coronium Mobile Proxies
Web Scraping & AI · Regulation · May 2026 · 12-min read

The EU AI Act in 2026: What August Enforcement Means for AI Training Data and Web Scraping

On 2 August 2026 the EU AI Act stops being a paper deadline and gains real teeth — GPAI enforcement, high-risk obligations, and a mandatory training-data disclosure tied to the EU copyright opt-out. Here's the researched, no-hype guide to the timeline, the penalties, the May-2026 simplification agreement, and what it means for anyone collecting public web data.

Coronium Technical Team
Published May 29, 2026
Verified 2026-05-29

General information, not legal advice. The AI Act is complex and still evolving (see the May-2026 simplification agreement below). Consult qualified EU counsel for your situation.

Aug 2
2026 — enforcement teeth
€15M
or 3% turnover (GPAI)
€35M
or 7% (prohibited uses)
Aug 2027
full application

TL;DR

2 Aug 2026 turns on high-risk obligations and GPAI enforcement (fines up to €15M / 3% of global turnover; prohibited uses up to €35M / 7%). GPAI providers must publish a training-data summary (Commission template) and respect the EU copyright opt-out — expressed through machine-readable signals like robots.txt and ai.txt. A May-2026 agreement may streamline/extend some high-risk timelines, but until it's law, treat August as operative. The Act doesn't ban scraping — it adds transparency + opt-out duties for AI training.

The phased timeline — and why August 2026 matters

The AI Act doesn't switch on all at once. It rolls out in stages, and 2 August 2026 is the stage where the rules that touch AI data the most acquire enforcement power:

2 Feb 2025 — prohibited practices

The banned uses (e.g. social scoring, certain biometric categorization) became prohibited.

2 Aug 2025 — GPAI obligations begin

General-purpose AI model duties (transparency, copyright policy, training-data summary) took effect. The voluntary GPAI Code of Practice was finalized 10 July 2025.

2 Aug 2026 — enforcement + high-risk

High-risk (Annex III) obligations apply, and the AI Office's GPAI enforcement powers (information requests, model evaluation, fines) switch on. This is the date with teeth.

2 Aug 2027 — full application

Remaining obligations, including AI embedded in regulated products, apply. Pre-existing GPAI models must have published their training-data summary by this date.

The training-data disclosure template

The provision most relevant to web data: every GPAI model provider must publish a summary of the content used to train the model, using a mandatory template the European Commission released. The template asks providers to disclose, in a structured way, the types of content, the data sources, and the methods of collection — including large public datasets and scraped web data.

The summary requirement took effect on 2 August 2025 for new models; models already on the market before that date have until 2 August 2027 to publish theirs. The practical effect is that "we scraped the web" is no longer an acceptable non-answer — providers serving the EU now have to describe what they collected and how.

The penalties have real scale

€35M / 7%
Prohibited AI practices (of global annual turnover, whichever is higher)
€15M / 3%
High-risk & most GPAI non-compliance
Market removal
National authorities can withdraw non-compliant systems from the EU

Crucially, the obligations bind providers regardless of where they are based if the system is placed on the EU market — so a US or Asian AI company serving EU users is in scope. This extraterritorial reach is why the Act is shaping global data-collection practice, not just European.

The May-2026 simplification agreement — don't bank on a delay

The timeline isn't entirely settled. A November 2025 Commission proposal floated pushing some deadlines toward late 2027, and on 7 May 2026 the Council and Parliament reached a political agreement (a "Digital Omnibus" simplification package) to streamline rules and extend certain high-risk timelines.

The honest read: a political agreement is not yet enacted law, and what's on the table is simplification of high-risk timelines — not a repeal of the GPAI transparency and copyright duties. Until the changes are actually in force, treat 2 August 2026 as the operative deadline rather than assuming relief that may not arrive.

What it means for data collectors

If you collect public web data — especially to train or fine-tune models for the EU market — the Act adds a compliance layer on top of ordinary scraping law:

  • Honor machine-readable opt-outs — robots.txt, ai.txt / TDM reservations now carry copyright weight under Article 4.
  • Document sources & methods — the training-data template rewards collectors who already keep a clean provenance trail.
  • Stay on public, logged-off data — don't circumvent access barriers; the DMCA §1201-style risk is separate from the Act and still applies.
  • Mind extraterritoriality — "we're not in the EU" is not a defense if your model serves EU users.

The infrastructure angle: compliant collection means behaving like a normal visitor on public pages — a real browser on a residential/mobile IP, honoring opt-outs, not defeating barriers. The regulation rewards transparency and restraint, which is exactly how high-trust mobile-IP collection already works.

FAQ

Collect public data the compliant way

Honor opt-outs, document sources, stay on public pages — as a real visitor on real 4G/5G carrier IPs across 20+ countries.