On 2 August 2026 the EU AI Act stops being a paper deadline and gains real teeth — GPAI enforcement, high-risk obligations, and a mandatory training-data disclosure tied to the EU copyright opt-out. Here's the researched, no-hype guide to the timeline, the penalties, the May-2026 simplification agreement, and what it means for anyone collecting public web data.
General information, not legal advice. The AI Act is complex and still evolving (see the May-2026 simplification agreement below). Consult qualified EU counsel for your situation.
2 Aug 2026 turns on high-risk obligations and GPAI enforcement (fines up to €15M / 3% of global turnover; prohibited uses up to €35M / 7%). GPAI providers must publish a training-data summary (Commission template) and respect the EU copyright opt-out — expressed through machine-readable signals like robots.txt and ai.txt. A May-2026 agreement may streamline/extend some high-risk timelines, but until it's law, treat August as operative. The Act doesn't ban scraping — it adds transparency + opt-out duties for AI training.
The AI Act doesn't switch on all at once. It rolls out in stages, and 2 August 2026 is the stage where the rules that touch AI data the most acquire enforcement power:
The banned uses (e.g. social scoring, certain biometric categorization) became prohibited.
General-purpose AI model duties (transparency, copyright policy, training-data summary) took effect. The voluntary GPAI Code of Practice was finalized 10 July 2025.
High-risk (Annex III) obligations apply, and the AI Office's GPAI enforcement powers (information requests, model evaluation, fines) switch on. This is the date with teeth.
Remaining obligations, including AI embedded in regulated products, apply. Pre-existing GPAI models must have published their training-data summary by this date.
The provision most relevant to web data: every GPAI model provider must publish a summary of the content used to train the model, using a mandatory template the European Commission released. The template asks providers to disclose, in a structured way, the types of content, the data sources, and the methods of collection — including large public datasets and scraped web data.
The summary requirement took effect on 2 August 2025 for new models; models already on the market before that date have until 2 August 2027 to publish theirs. The practical effect is that "we scraped the web" is no longer an acceptable non-answer — providers serving the EU now have to describe what they collected and how.
GPAI providers must also have a policy to comply with EU copyright law, and that includes respecting the text-and-data-mining (TDM) opt-out under Article 4 of the 2019 Copyright Directive. Rightsholders can reserve their works from data mining — and the established way to express that reservation at web scale is machine-readable: robots.txt, TDM metadata, and emerging signals like ai.txt.
This is the link between the regulation and the plumbing: the AI Act gives legal consequence to the opt-out signals we break down in robots.txt vs llms.txt vs ai.txt. An opt-out that was once a polite request becomes evidence of a reservation a compliant GPAI provider must honor.
Curious whether a given site already signals an AI opt-out? Our free AI Crawler Checker reads a domain's robots.txt and shows which AI crawlers it allows or blocks.
Crucially, the obligations bind providers regardless of where they are based if the system is placed on the EU market — so a US or Asian AI company serving EU users is in scope. This extraterritorial reach is why the Act is shaping global data-collection practice, not just European.
The timeline isn't entirely settled. A November 2025 Commission proposal floated pushing some deadlines toward late 2027, and on 7 May 2026 the Council and Parliament reached a political agreement (a "Digital Omnibus" simplification package) to streamline rules and extend certain high-risk timelines.
The honest read: a political agreement is not yet enacted law, and what's on the table is simplification of high-risk timelines — not a repeal of the GPAI transparency and copyright duties. Until the changes are actually in force, treat 2 August 2026 as the operative deadline rather than assuming relief that may not arrive.
If you collect public web data — especially to train or fine-tune models for the EU market — the Act adds a compliance layer on top of ordinary scraping law:
The infrastructure angle: compliant collection means behaving like a normal visitor on public pages — a real browser on a residential/mobile IP, honoring opt-outs, not defeating barriers. The regulation rewards transparency and restraint, which is exactly how high-trust mobile-IP collection already works.
AI crawler blocking, Pay-Per-Crawl, and the data wars in full.
hiQ, Meta v Bright Data, Reddit v Perplexity & DMCA §1201.
The opt-out signals the AI Act gives legal weight to.
How AI agents collect web data and why the IP layer decides.
See which AI bots a domain allows or blocks in its robots.txt.
Real 4G/5G carrier IPs for legitimate public-data collection.