What is public data aggregation?

Public data aggregation is the practice of collecting information that is legally public but scattered across many separate sources — government registries, regulators, subsidy agencies, court and tax bodies, open-data portals — and reconciling it into a single, queryable dataset. The value is not in any one source; it’s in joining them so that one record shows the full picture of an entity. National business registries, property datasets, and farm registries are all examples.

Is it legal to scrape and aggregate public government data?

Reading and aggregating data that a government publishes for public access is generally legitimate, and case law in the US and EU has repeatedly protected access to genuinely public information. The legal risk lives at the edges: terms of service on a specific portal, copyright on compiled databases (notably the EU sui generis database right), and — most importantly — personal data. Aggregating entity-level public facts is one thing; republishing personal identifiers like national ID numbers or home addresses can trigger GDPR. See our deeper write-ups on whether web scraping is legal in 2026 and the closing web.

Why do large public-data crawls get blocked even when the data is public?

Because “publicly readable” and “happy to be crawled at volume” are different things. National portals protect their own infrastructure with rate limits, regional gating, bot-detection WAFs, and outright blocks on datacenter IP ranges. A scraper hammering from a single cloud IP looks nothing like a citizen browsing, so it gets throttled or banned long before you finish. Distributing requests across clean residential or mobile IPs at a polite rate is what keeps a large, legitimate aggregation crawl alive.

What is entity resolution and why is it the hard part?

Entity resolution is deciding that “UAB Agrochema” in the tax registry, the subsidy database, and the inspection record are all the same organisation, and merging them into one record. It’s hard because sources use different names, formats and identifiers, and there’s often no shared key. A wrong match silently corrupts the dataset. The winning pattern is to anchor on one canonical identifier (a company or register code), fuzzy-match candidates to it, and log every merge decision for audit.

What is data lineage and why does it matter for aggregated data?

Data lineage (or provenance) is the record of where each value came from and when — which source, which URL, which fetch timestamp. For aggregated public data it’s essential: it’s what lets a user trust a field, lets you debug drift when a source changes, and lets you defend the dataset’s accuracy. A registry that publishes “directly from official state sources” with documented lineage is far more credible than an opaque scrape.

How do you keep an aggregated dataset fresh?

You schedule per-source refreshes on the cadence each source actually changes (some weekly, some monthly), and you add schema-drift detection so a silent format change raises an alert instead of quietly returning nulls. Freshness is a feature, not a one-time event — the value of a registry collapses the moment it’s out of date.

What role do proxies play in public-data aggregation?

They are the collection layer. To gather data from dozens of national portals without being rate-shaped or IP-banned, you distribute requests across a pool of clean IPs that look like real users and let you target the right region. Mobile and residential IPs carry the carrier/ISP trust that datacenter ranges lack, which is why serious data-collection operations route large public-data crawls through them. Coronium’s mobile proxies are built for exactly this kind of high-trust, geo-targeted collection.

Can AI agents do public-data aggregation autonomously?

Increasingly, yes — the same collection and reconciliation steps can be driven by an autonomous agent that fetches, parses and merges sources on a schedule, and even acquires the proxy bandwidth it needs on its own. We cover that shift in our guides on agentic proxies (x402 + MCP) and scraping in the agentic era.

WEB SCRAPING & DATA

2026 · 13 min read

How to aggregate public government data at scale in 2026

Turning dozens of fragmented public sources into one clean, trustworthy dataset is one of the most valuable — and most underestimated — jobs in data engineering. Here’s what actually makes it hard: source fragmentation, access limits, the IP strategy that quietly decides whether your crawl survives, entity resolution, data lineage, and the GDPR line. With a real national-registry case study to ground it.

TL;DR

The value of public-data aggregation is in joining fragmented sources, not any single one.
Public ≠ crawl-friendly: portals rate-limit, geo-gate and block datacenter IPs — the collection layer makes or breaks the project.
The hard parts are entity resolution and lineage, not the fetch.
A real example: a national farm registry joining 141,337 records from 30 state sources, done right.

What public-data aggregation is — and why it’s surging

Governments publish an enormous amount of data — company registries, subsidy disbursements, inspection records, property and land data, court filings, procurement. Almost all of it is legally public. Almost none of it is usable on its own, because it lives in separate portals, in separate formats, under separate identifiers. Public-data aggregation is the work of collecting those sources and reconciling them into a single dataset where one query returns the whole picture of an entity.

Demand for it has surged on three fronts at once: open-data and transparency mandates that put more public records online, market-intelligence teams that need a unified view of an industry, and the AI training and retrieval boom that treats clean, structured public datasets as gold. The supply side, though, keeps hitting the same wall — which is what the rest of this guide is about.

The six problems that actually make it hard

1. Source fragmentation

The same entity lives in a dozen portals — a business registry, a tax body, a subsidy agency, an inspection database — each with its own format, identifier and quirks. There is rarely a single API that returns it all.

2. Access & rate limits

Public ≠ open-bandwidth. National portals throttle aggressively, gate by region, sit behind WAFs, and block datacenter IP ranges outright — even for data that is legally public to read.

3. Entity resolution

Merging “UAB Agrochema” from five sources into one record without a shared key is the hardest part. Names differ, codes don’t always line up, and a bad match corrupts the whole dataset.

4. Freshness & drift

Sources update on different cadences and silently change their schema. A pipeline that worked last month returns nulls today. You need scheduled refreshes and drift detection, not a one-off scrape.

5. Lineage & provenance

For anyone to trust the dataset, every field must carry where it came from and when. Without documented lineage, an aggregated record is just an unverifiable claim.

6. Legal & GDPR line

Public-record data and personal data overlap. Aggregating is legitimate; republishing personal identifiers is not. You have to draw the line deliberately, source by source.

Case study

Anatomy of a national registry: fermos.lt

A clean recent example of public-data aggregation done well is fermos.lt, Lithuania’s national agricultural registry. It takes data that was scattered across roughly 30 official state sources and joins it into one public platform — “Visi Lietuvos ūkiai iš oficialių valstybės šaltinių” (all Lithuanian farms from official state sources). The result is a single searchable registry of 141,337 subjects — 14,035 legal entities and 127,302 farmers — across all 59 municipalities, plus an archive of 238,888 EU-subsidy records.

What makes it a good study isn’t the scale — it’s the discipline. Each farm profile reconciles fields from different agencies onto one agricultural-register (JAR) identifier: revenue, equipment counts, EU-subsidy recipient years, ecological-certification expiry, and regulatory flags for veterinary (VET), phytosanitary (FITO) and EU-support (ES) status. Crucially, it draws the privacy line on purpose — personal codes, birthdates and home addresses are deliberately not published — and it lets the subject of a record claim and correct it (verified via Smart-ID, in 1–2 business days). That combination of automated aggregation, documented provenance and a human correction layer is exactly the pattern this whole article is about.

Five lessons it gets right

Treat the identifier as the product. fermos.lt anchors everything on the agricultural-register (JAR) code so records from different agencies reconcile to one entity.
Publish provenance, not just data. Pulling “directly from official state sources” with documented lineage is what turns a scrape into a registry people cite.
Draw the GDPR line up front. Personal codes, birthdates and home addresses are deliberately not published; only entity-level, public-interest fields are surfaced.
Make freshness a feature. Automatic refreshes across sources keep 141,337 records current instead of letting the dataset rot the week after launch.
Let the subject correct the record. A claim-and-verify flow (here via Smart-ID, 1–2 business days) adds a human accuracy layer on top of the automated pipeline.

Figures above are as published by fermos.lt at the time of writing.

The collection layer: why your IP strategy decides the outcome

Here’s the part teams underestimate until it stops their pipeline: gathering data from dozens of national portals is not a fetch problem, it’s an access problem. “Publicly readable” and “happy to be crawled at volume” are completely different things. Government and institutional portals protect their own infrastructure with rate limits, regional gating, bot-detection WAFs, and blanket blocks on datacenter IP ranges. A crawler firing thousands of requests from one cloud IP looks nothing like a citizen browsing — so it gets throttled or banned long before the job finishes.

The fix is to make legitimate, polite collection look legitimate: distribute requests across a pool of clean IPs that carry real carrier or ISP trust, target the right region for geo-gated portals, and pace the crawl to each source’s limits. This is precisely why serious aggregation operations route large public-data crawls through real mobile and residential IPs rather than datacenter ranges — the trust profile is what keeps a months-long, multi-source pipeline alive.

Where Coronium fits. Coronium’s mobile proxies for web scraping give large public-data crawls the two things they need most: carrier-grade IP trust that portals don’t rate-shape, and geo-targeting across 20+ countries so you can reach region-gated national sources. For fully autonomous pipelines, agentic proxies (x402 + MCP) let an agent acquire that bandwidth on its own.

Joining and trusting the data: resolution and lineage

Once the bytes are flowing, the real engineering begins. Entity resolution — deciding that the same organisation in five sources is one record — is where aggregation projects live or die. With no shared key across portals, the durable pattern is to pick one canonical identifier (a company or register code), fuzzy-match every candidate to it, and log each merge decision so a bad match can be traced and undone. fermos.lt’s choice to anchor on the JAR code is a textbook version of this.

Then comes trust. An aggregated record is only as credible as its lineage: every field should carry where it came from, which URL, and when it was fetched. Documented provenance is what lets users cite the dataset, lets you debug drift when a source silently changes its schema, and lets you defend accuracy. It’s the difference between a registry people rely on and an opaque scrape nobody can verify.

The legal and GDPR line

Aggregating data that a government publishes for public access is broadly legitimate, and courts in the US and EU have repeatedly protected access to genuinely public information. The risk lives at the edges: a specific portal’s terms of service, the EU’s sui generis database right over compiled collections, and — above all — personal data. Entity-level public facts are fair game; national ID numbers, birthdates and home addresses are not, and republishing them can trigger GDPR.

The well-built registries handle this by drawing the boundary deliberately — surfacing public-interest, entity data while withholding personal identifiers, exactly as fermos.lt does. For the broader legal picture, see our guides on whether web scraping is legal in 2026 and the closing web and Pay-Per-Crawl.

Build checklist: from fragmented sources to a trustworthy dataset

Map every source and the identifier each one uses — pick the one canonical key you’ll reconcile to.

Budget your access: rate limits, regional gates, and a clean residential/mobile IP pool so big crawls don’t get blocked or rate-shaped.

Build entity resolution before you scale — fuzzy match, then verify against the canonical key; log every merge decision.

Schedule refreshes per source and add schema-drift alerts so silent format changes don’t poison the dataset.

Attach lineage to every field: source, URL, fetch timestamp. Provenance is what makes the data trustworthy.

Set the GDPR boundary explicitly: aggregate public/entity data; never republish personal identifiers.

Offer a correction path so subjects can claim and fix their own records.

Frequently asked questions

Collect public data at scale without getting blocked

Real mobile and residential IPs with carrier-grade trust and geo-targeting across 20+ countries — the collection layer serious aggregation pipelines run on.