B2B intelligence data is dirty. Companies move. Officers come and go. Industry codes get reclassified. Phone numbers go dead. The legacy way of dealing with this — Bisnode, ZoomInfo, Apollo — is to refresh data on a quarterly schedule and let the customer eat the staleness.
Enterprise compliance teams hate this. When they're doing KYB on a German entity, they need to know: did this address come from the 2023 annual filing, or yesterday's web crawl? Was the industry code self-declared or auto-inferred? Is this company still listed in the registry today, or did the data fall behind?
We built per-fact provenance to answer this.
Our company_facts table is the heart of the canonical layer:
CREATE TABLE company_facts ( canonical_id UUID NOT NULL REFERENCES companies(canonical_id), fact_type TEXT NOT NULL, -- 'address', 'employees', 'industry', 'website', ... value JSONB NOT NULL, source TEXT NOT NULL, -- 'brreg', 'companies_house', 'gleif', 'crawled', ... confidence NUMERIC NOT NULL DEFAULT 1.0, fetched_at TIMESTAMPTZ NOT NULL DEFAULT now(), verified_at TIMESTAMPTZ, PRIMARY KEY (canonical_id, fact_type, source, fetched_at) );
Every fact has its own source. When a company has both an address from the official register AND an address from their own website, we store both — the customer sees both, scored.
GET /v2/companies/8b0a130f-e6e9-4ed6-97f3-8ebbe06dfac3
{
"canonical_id": "8b0a130f-...",
"name": "EQUINOR ASA",
"facts": {
"address": { "city": "STAVANGER", "street": "Forusbeen 50", ... },
"employees": { "count": 21327 },
"industry": { "nace": "06.100", ... }
},
"_provenance": {
"address": { "source": "brreg", "confidence": 1, "fetched_at": "2026-05-20T16:31:43Z" },
"employees":{ "source": "brreg", "confidence": 1, "fetched_at": "2026-05-20T16:31:43Z" },
"industry": { "source": "brreg", "confidence": 1, "fetched_at": "2026-05-20T16:31:43Z" }
}
}
Auditors love this. Customers can answer "show me the source for every field you used in the KYB decision" without us building a custom report.
Storage: 83.4M fact rows on 35.8M canonical companies = ~2.3 facts per company on average. Total table size: 23 GB compressed. Acceptable.
Write amplification: Every registry refresh writes new fact rows (we never UPDATE in place — that would lose the audit trail). 24h of brreg sync = ~50K new fact rows. Negligible.
Query overhead: The canonical lookup does TWO queries — one for the company row, one for the fact set. ~180ms p95 for Equinor (6 facts, all populated). With a CTE we could do it in one query but the two-query pattern is more debuggable.
Honest caveat: Not every company has facts populated. Norwegian companies fed from brreg are deep; UK companies fed from Companies House CSV bulk are shallow (no employees, no industry detail). LEI-attested entities (DE, FR, IT) only have name + country + LEI. We're crawling websites to fill the long tail; coverage grows daily.
Three reasons.
We started new. No legacy schema, no sales team to placate. The right architecture by default.
Sandbox key (no signup): sandbox_try_2026
curl -H "X-API-Key: sandbox_try_2026" \ https://api.nordicdata.cloud/v2/companies/8b0a130f-e6e9-4ed6-97f3-8ebbe06dfac3/provenance