Engineering blog · 2026-05-22

Per-fact provenance — and what it cost to actually do this

When customers ask "where does this address come from?", every legacy B2B intelligence provider goes silent. We built a schema that answers it for every field, on every response. Here's how it's wired.

The problem

B2B intelligence data is dirty. Companies move. Officers come and go. Industry codes get reclassified. Phone numbers go dead. The legacy way of dealing with this — Bisnode, ZoomInfo, Apollo — is to refresh data on a quarterly schedule and let the customer eat the staleness.

Enterprise compliance teams hate this. When they're doing KYB on a German entity, they need to know: did this address come from the 2023 annual filing, or yesterday's web crawl? Was the industry code self-declared or auto-inferred? Is this company still listed in the registry today, or did the data fall behind?

We built per-fact provenance to answer this.

The schema

Our company_facts table is the heart of the canonical layer:

CREATE TABLE company_facts (
  canonical_id   UUID NOT NULL REFERENCES companies(canonical_id),
  fact_type      TEXT NOT NULL,            -- 'address', 'employees', 'industry', 'website', ...
  value          JSONB NOT NULL,
  source         TEXT NOT NULL,            -- 'brreg', 'companies_house', 'gleif', 'crawled', ...
  confidence     NUMERIC NOT NULL DEFAULT 1.0,
  fetched_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
  verified_at    TIMESTAMPTZ,
  PRIMARY KEY (canonical_id, fact_type, source, fetched_at)
);

Every fact has its own source. When a company has both an address from the official register AND an address from their own website, we store both — the customer sees both, scored.

What it looks like in the API

GET /v2/companies/8b0a130f-e6e9-4ed6-97f3-8ebbe06dfac3

{
  "canonical_id": "8b0a130f-...",
  "name": "EQUINOR ASA",
  "facts": {
    "address": { "city": "STAVANGER", "street": "Forusbeen 50", ... },
    "employees": { "count": 21327 },
    "industry": { "nace": "06.100", ... }
  },
  "_provenance": {
    "address":  { "source": "brreg", "confidence": 1, "fetched_at": "2026-05-20T16:31:43Z" },
    "employees":{ "source": "brreg", "confidence": 1, "fetched_at": "2026-05-20T16:31:43Z" },
    "industry": { "source": "brreg", "confidence": 1, "fetched_at": "2026-05-20T16:31:43Z" }
  }
}

Auditors love this. Customers can answer "show me the source for every field you used in the KYB decision" without us building a custom report.

The cost

Storage: 83.4M fact rows on 35.8M canonical companies = ~2.3 facts per company on average. Total table size: 23 GB compressed. Acceptable.

Write amplification: Every registry refresh writes new fact rows (we never UPDATE in place — that would lose the audit trail). 24h of brreg sync = ~50K new fact rows. Negligible.

Query overhead: The canonical lookup does TWO queries — one for the company row, one for the fact set. ~180ms p95 for Equinor (6 facts, all populated). With a CTE we could do it in one query but the two-query pattern is more debuggable.

Honest caveat: Not every company has facts populated. Norwegian companies fed from brreg are deep; UK companies fed from Companies House CSV bulk are shallow (no employees, no industry detail). LEI-attested entities (DE, FR, IT) only have name + country + LEI. We're crawling websites to fill the long tail; coverage grows daily.

Why no other provider does this

Three reasons.

  1. Schema rewrite cost. Bisnode/Apollo/ZoomInfo have flat schemas (one row per company, one column per field). Adding provenance means a many-to-one rewrite plus updating every report-generation pipeline. Decade-old codebases don't survive that surgery.
  2. Sales positioning. The legacy vendors pitch "trust us, we cleaned the data." Surfacing per-fact source breaks that frame — it admits some facts are weaker than others. Sales teams resist.
  3. It exposes data freshness. If your CSV bulk is 18 months old, customers will see it. Vendors prefer to hide that.

We started new. No legacy schema, no sales team to placate. The right architecture by default.

Try it

Sandbox key (no signup): sandbox_try_2026

curl -H "X-API-Key: sandbox_try_2026" \
  https://api.nordicdata.cloud/v2/companies/8b0a130f-e6e9-4ed6-97f3-8ebbe06dfac3/provenance

Full docs: /docs · Sandbox playground: /sandbox