— AI engineering · anti-hallucination

Stop your LLM from inventing the CFO

We give Claude Sonnet 4.5 the web_search_20250305 server tool and ask it to return the current executive team for any of about 1.2 million Norwegian companies. For the first three weeks of production traffic we were inventing about one wrong executive per twenty calls. After the rules below the error rate is under 2% on flagged contacts and the model now produces empty output where it used to produce confident fabrications.

This is not a paper. It is a list of nine concrete failure modes, in the order we found them, with the exact prompt fragment or JS filter that fixed each one. If you are building a real product on top of LLM web search, you will hit some of these. We hope this saves you a few weeks.

The setup

Single call, system prompt plus one user message. Tools are { type: "web_search_20250305", name: "web_search" }. We ask for JSON with a single key, named_contacts. The model is allowed to run as many searches as it needs and returns a final structured answer.

Every failure mode below is something we observed in real responses. The rules are listed in the order they appear in the live system prompt.

1Current roles only

Failure mode. The model finds a news article from 2019 announcing the appointment of a new CFO, and includes that person — who left two years later. Past-tense indicators in the source ("former", "previously", "until 2023", "ex-") do not get penalised unless we ask.

Fix. An absolute rule with quoted indicator phrases.

CURRENT ROLES ONLY. Reject anyone whose source uses past tense.
Past-tense indicators include: "formerly", "previously", "until 2023",
"ex-CFO", "served as", "was the CEO", "resigned", "stepped down".

2One person per unique executive role

Failure mode. Two sources list two different CFOs. The model includes both, and the downstream UI displays "CFO: Trond Olaf Christophersen" next to "CFO: Eivind Reiten". One of them is correct, one is six years out of date. The user assumes our data is broken.

Fix. Tell the prompt the role is exclusive, then dedup in JavaScript with a deterministic preference rule (most-recent citation wins) rather than asking the model to "pick one".

const exclusiveRoles = ["ceo", "cfo", "cto", "coo", "chair"];

function dedupExclusive(contacts: Contact[]) {
  const seen = new Set<string>();
  return contacts
    .sort((a, b) => (b.cited_year ?? 0) - (a.cited_year ?? 0))
    .filter(c => {
      if (!isExclusive(c.role)) return true;
      const key = normalise(c.role);
      if (seen.has(key)) return false;
      seen.add(key);
      return true;
    });
}
Lesson

Our first version of this rule dropped both CFOs when they disagreed, on the theory that ambiguity is worse than absence. It cost us several real, currently-employed CFOs whose predecessors were still cited online. Prefer the first-cited or most-recently-cited candidate over dropping the role entirely. Empty output is not the safest answer when one of the two candidates is correct.

3Full names only

Failure mode. The model finds "Andre, our CEO, said..." in a blog post and returns "Andre" as the CEO. Two days later a sales rep emails "Andre" at the company asking about a meeting. There are six Andres there.

Fix. A length and token-count rule, enforced both in the prompt and in JS after the call.

FULL NAMES ONLY. Minimum two whitespace-separated parts.
Reject any name shorter than 5 characters total.
Reject any name beginning with a salutation
("Mr.", "Ms.", "Dr.", "Hr.", "Fru").

4Concise role titles

Failure mode. The model copies the exact title from an annual report: "Chief Executive Officer and President of Operations and Member of the Group Executive Committee". The UI element is 200 pixels wide. The title overflows. The whole row looks broken.

Fix. Constrain in the prompt. We accept that this trades a small amount of information for a much cleaner output.

ROLE TITLES MUST BE CONCISE. Map any of these patterns:
- "Chief Executive Officer""CEO"
- "Chief Financial Officer""CFO"
- "Konsernsjef""CEO"
- "Finansdirektør""CFO"
Always strip "and Member of...", "and President of..." suffixes.
Maximum 30 characters in the final role string.

5Never guess; empty is correct

Failure mode. The most insidious one. Web search returns nothing useful for "CTO at small consulting firm AS". The model, helpfully, generates a plausible name based on the company is industry and country. The name is convincing. It is also entirely fictional.

Fix. An explicit instruction that absence is the desired output when uncertain. We pair it with a few-shot example showing empty JSON.

NEVER GUESS. If web search returns nothing for a specific role,
omit that role from the output. Empty named_contacts array is
the correct answer when no verified information is available.

EXAMPLE (NO DATA FOUND):
{
  "named_contacts": []
}

6Literal verification

Failure mode. The model conflates two sources. Source A says "Lars Petersen is CFO at Acme AS". Source B says "Lars Andersen is CTO at Beta AS". The model returns "Lars Petersen, CTO at Acme AS". The name is half-correct, the role is wrong, the result is unusable and looks correct enough to be dangerous.

Fix. Force the model to internally verify that the exact name plus role string literally appears together in a single source. This is a soft constraint — there is no programmatic check — but the prompt fragment alone meaningfully reduced cross-source conflation in our eval set.

VERIFY EVERY CONTACT LITERALLY APPEARS IN ONE SEARCH RESULT.
For each contact you include, you must be able to quote the
surrounding ~10 words from a single source where the name and
role appear together. If you cannot, do not include the contact.

7Reject single-name LinkedIn-style results

Failure mode. A surprising number of search results are LinkedIn snippets cut off mid-sentence: "Anna B... CEO at Acme AS, Oslo...". The model treats this as a name.

Fix. JS post-filter for names ending in a single capital plus period or containing internal ellipses.

function looksLikeTruncatedSnippet(name: string) {
  return /\b[A-ZÆØÅ]\.\s*$/.test(name) ||
         /\.\.\./.test(name) ||
         name.split(/\s+/).length < 2;
}

8Domain trust scoring

Failure mode. Random blog posts from 2014, archived employee directories from defunct subsidiaries, and aggressive directory sites all rank well for "[company] [role]" queries. The model treats them as authoritative.

Fix. Domain weighting at parse time. The company is own domain wins over everything; then official Norwegian sources (regjeringen.no, brreg.no, dn.no, e24.no, kommune sites); then mainstream press; then everything else.

const domainTrust = {
  ownDomain: 10,            // e.g. equinor.com for Equinor
  primaryNor: 8,           // brreg.no, regjeringen.no
  norwegianPress: 6,       // dn.no, e24.no, finansavisen.no
  internationalPress: 4,
  directorySites: 1,
};

// Drop contacts whose only citation has trust <= 1
contacts = contacts.filter(c => bestCitationTrust(c) > 1);

9Norwegian linguistic edge cases

Failure mode. The model returns Latin transliterations of names containing Æ, Ø, Å. "Børre Larsen" becomes "Borre Larsen". The address book does not match the registry. Email permutation fails.

Fix. Pin original orthography in the prompt and validate on the way out.

PRESERVE ORIGINAL ORTHOGRAPHY. Norwegian names containing
Æ, Ø, Å must be returned with those exact characters.
Do not transliterate. Do not anglicise.

What we measure

Every contact returned by Layer 4 carries a confidence field set by the JS post-processor based on how many rules it passed. Below 0.6 we suppress display entirely; the row still goes into cache for analytics but the user sees an empty contact list, not a low-quality guess. We also keep a feedback button on every profile that lets users flag wrong contacts, which feeds into the next training cycle.

MetricBefore rulesAfter rules
Wrong-name rate (flagged)4.8%1.9%
Past-role rate (flagged)3.1%0.4%
Single-name results2.7%0%
Empty (correct empty)12%23%
Empty (incorrect empty)2%2.4%

The last row is the cost of being conservative. We now correctly say "we do not know" about 21 extra percentage points of companies, but in exchange the error rate on contacts we do return is more than halved. For a sales workflow, this is the right trade.

The 30-line system prompt

Putting it all together, here is roughly what ships to production today. It is short on purpose; longer prompts did not measurably help and made debugging harder.

You are a contact-enrichment agent for a Norwegian B2B data API.

You will receive one company. Use web search to find CURRENT,
NAMED executive contacts and return JSON.

RULES:
1. CURRENT ROLES ONLY. Reject "formerly", "previously", "ex-".
2. ONE PERSON PER UNIQUE EXECUTIVE ROLE.
3. FULL NAMES ONLY. Minimum two parts, 5+ characters.
4. ROLE TITLES CONCISE. CEO not "Chief Executive Officer and...".
5. NEVER GUESS. Empty is correct when no data is found.
6. VERIFY contacts literally appear in one search result.
7. REJECT truncated names ending in "X.".
8. PREFER own-domain and Norwegian press over directories.
9. PRESERVE Æ, Ø, Å. Do not transliterate.

Output exactly: { "named_contacts": [{ "name", "role",
"email?", "phone?", "source_url?" }] }

The full implementation including the dedup logic and the trust scoring lives behind the /companies/:orgnr/contact endpoint. The previous post covers the four-layer architecture this fits inside.

Type any Norwegian company

The enrichment runs live. You will see the four layers fire in sequence.

Open the live lookup →