How to aggregate public government data at scale in 2026
Turning dozens of fragmented public sources into one clean, trustworthy dataset is one of the most valuable — and most underestimated — jobs in data engineering. Here’s what actually makes it hard: source fragmentation, access limits, the IP strategy that quietly decides whether your crawl survives, entity resolution, data lineage, and the GDPR line. With a real national-registry case study to ground it.
TL;DR
- The value of public-data aggregation is in joining fragmented sources, not any single one.
- Public ≠ crawl-friendly: portals rate-limit, geo-gate and block datacenter IPs — the collection layer makes or breaks the project.
- The hard parts are entity resolution and lineage, not the fetch.
- A real example: a national farm registry joining 141,337 records from 30 state sources, done right.
What public-data aggregation is — and why it’s surging
Governments publish an enormous amount of data — company registries, subsidy disbursements, inspection records, property and land data, court filings, procurement. Almost all of it is legally public. Almost none of it is usable on its own, because it lives in separate portals, in separate formats, under separate identifiers. Public-data aggregation is the work of collecting those sources and reconciling them into a single dataset where one query returns the whole picture of an entity.
Demand for it has surged on three fronts at once: open-data and transparency mandates that put more public records online, market-intelligence teams that need a unified view of an industry, and the AI training and retrieval boom that treats clean, structured public datasets as gold. The supply side, though, keeps hitting the same wall — which is what the rest of this guide is about.
The six problems that actually make it hard
1. Source fragmentation
The same entity lives in a dozen portals — a business registry, a tax body, a subsidy agency, an inspection database — each with its own format, identifier and quirks. There is rarely a single API that returns it all.
2. Access & rate limits
Public ≠ open-bandwidth. National portals throttle aggressively, gate by region, sit behind WAFs, and block datacenter IP ranges outright — even for data that is legally public to read.
3. Entity resolution
Merging “UAB Agrochema” from five sources into one record without a shared key is the hardest part. Names differ, codes don’t always line up, and a bad match corrupts the whole dataset.
4. Freshness & drift
Sources update on different cadences and silently change their schema. A pipeline that worked last month returns nulls today. You need scheduled refreshes and drift detection, not a one-off scrape.
5. Lineage & provenance
For anyone to trust the dataset, every field must carry where it came from and when. Without documented lineage, an aggregated record is just an unverifiable claim.
6. Legal & GDPR line
Public-record data and personal data overlap. Aggregating is legitimate; republishing personal identifiers is not. You have to draw the line deliberately, source by source.
Anatomy of a national registry: fermos.lt
A clean recent example of public-data aggregation done well is fermos.lt, Lithuania’s national agricultural registry. It takes data that was scattered across roughly 30 official state sources and joins it into one public platform — “Visi Lietuvos ūkiai iš oficialių valstybės šaltinių” (all Lithuanian farms from official state sources). The result is a single searchable registry of 141,337 subjects — 14,035 legal entities and 127,302 farmers — across all 59 municipalities, plus an archive of 238,888 EU-subsidy records.
What makes it a good study isn’t the scale — it’s the discipline. Each farm profile reconciles fields from different agencies onto one agricultural-register (JAR) identifier: revenue, equipment counts, EU-subsidy recipient years, ecological-certification expiry, and regulatory flags for veterinary (VET), phytosanitary (FITO) and EU-support (ES) status. Crucially, it draws the privacy line on purpose — personal codes, birthdates and home addresses are deliberately not published — and it lets the subject of a record claim and correct it (verified via Smart-ID, in 1–2 business days). That combination of automated aggregation, documented provenance and a human correction layer is exactly the pattern this whole article is about.
Five lessons it gets right
- Treat the identifier as the product. fermos.lt anchors everything on the agricultural-register (JAR) code so records from different agencies reconcile to one entity.
- Publish provenance, not just data. Pulling “directly from official state sources” with documented lineage is what turns a scrape into a registry people cite.
- Draw the GDPR line up front. Personal codes, birthdates and home addresses are deliberately not published; only entity-level, public-interest fields are surfaced.
- Make freshness a feature. Automatic refreshes across sources keep 141,337 records current instead of letting the dataset rot the week after launch.
- Let the subject correct the record. A claim-and-verify flow (here via Smart-ID, 1–2 business days) adds a human accuracy layer on top of the automated pipeline.
Figures above are as published by fermos.lt at the time of writing.
The collection layer: why your IP strategy decides the outcome
Here’s the part teams underestimate until it stops their pipeline: gathering data from dozens of national portals is not a fetch problem, it’s an access problem. “Publicly readable” and “happy to be crawled at volume” are completely different things. Government and institutional portals protect their own infrastructure with rate limits, regional gating, bot-detection WAFs, and blanket blocks on datacenter IP ranges. A crawler firing thousands of requests from one cloud IP looks nothing like a citizen browsing — so it gets throttled or banned long before the job finishes.
The fix is to make legitimate, polite collection look legitimate: distribute requests across a pool of clean IPs that carry real carrier or ISP trust, target the right region for geo-gated portals, and pace the crawl to each source’s limits. This is precisely why serious aggregation operations route large public-data crawls through real mobile and residential IPs rather than datacenter ranges — the trust profile is what keeps a months-long, multi-source pipeline alive.
Where Coronium fits. Coronium’s mobile proxies for web scraping give large public-data crawls the two things they need most: carrier-grade IP trust that portals don’t rate-shape, and geo-targeting across 20+ countries so you can reach region-gated national sources. For fully autonomous pipelines, agentic proxies (x402 + MCP) let an agent acquire that bandwidth on its own.
Joining and trusting the data: resolution and lineage
Once the bytes are flowing, the real engineering begins. Entity resolution — deciding that the same organisation in five sources is one record — is where aggregation projects live or die. With no shared key across portals, the durable pattern is to pick one canonical identifier (a company or register code), fuzzy-match every candidate to it, and log each merge decision so a bad match can be traced and undone. fermos.lt’s choice to anchor on the JAR code is a textbook version of this.
Then comes trust. An aggregated record is only as credible as its lineage: every field should carry where it came from, which URL, and when it was fetched. Documented provenance is what lets users cite the dataset, lets you debug drift when a source silently changes its schema, and lets you defend accuracy. It’s the difference between a registry people rely on and an opaque scrape nobody can verify.
The legal and GDPR line
Aggregating data that a government publishes for public access is broadly legitimate, and courts in the US and EU have repeatedly protected access to genuinely public information. The risk lives at the edges: a specific portal’s terms of service, the EU’s sui generis database right over compiled collections, and — above all — personal data. Entity-level public facts are fair game; national ID numbers, birthdates and home addresses are not, and republishing them can trigger GDPR.
The well-built registries handle this by drawing the boundary deliberately — surfacing public-interest, entity data while withholding personal identifiers, exactly as fermos.lt does. For the broader legal picture, see our guides on whether web scraping is legal in 2026 and the closing web and Pay-Per-Crawl.
Build checklist: from fragmented sources to a trustworthy dataset
Map every source and the identifier each one uses — pick the one canonical key you’ll reconcile to.
Budget your access: rate limits, regional gates, and a clean residential/mobile IP pool so big crawls don’t get blocked or rate-shaped.
Build entity resolution before you scale — fuzzy match, then verify against the canonical key; log every merge decision.
Schedule refreshes per source and add schema-drift alerts so silent format changes don’t poison the dataset.
Attach lineage to every field: source, URL, fetch timestamp. Provenance is what makes the data trustworthy.
Set the GDPR boundary explicitly: aggregate public/entity data; never republish personal identifiers.
Offer a correction path so subjects can claim and fix their own records.
Frequently asked questions
Related reading
The Closing Web & Pay-Per-Crawl 2026
How AI-crawler blocking is reshaping public-data access.
Is web scraping legal in 2026?
The legal lines around scraping public data.
Scraping in the agentic era (MCP)
When agents do the collecting.
Agentic proxies (x402 · MCP)
IPs an agent buys to run its own crawls.
Mobile proxies for web scraping
The collection layer — carrier-grade, geo-targeted.
robots.txt vs llms.txt vs ai.txt
What crawl directives really control.