[nevrai]
· 7 min read

Semantic Entity Resolution Without LLM: When Cosine > 0.92 Is Enough

Knowledge graphs built from unstructured text have a persistent problem: the same real-world entity shows up under different names. “Customer churn” and “user attrition.” “Margin” and “gross margin.” “Ivanov I.I.” and “Ivan Ivanov.” Same concept, different text.

Entity resolution — figuring out whether two mentions refer to the same thing — typically means expensive LLM verification. With 1,000 nodes, that’s 499,500 pairs to check. Even with cheap models, a full deduplication cycle costs tens of dollars.

The fix: handle 95% of cases without any LLM calls.

Why Levenshtein Fails

Edit distance catches typos and morphological variations fine. It completely breaks against synonyms and paraphrases. “Customer churn” and “user attrition” have maximum edit distance despite being semantically identical — because synonym pairs are the primary duplicate source in real research data, not typos.

Embeddings + Cosine Similarity

Text embeddings map phrases into high-dimensional vectors where semantically similar terms cluster nearby. Cosine similarity measures the angle between vectors — exactly the semantic proximity metric we need.

“Customer churn” vs “user attrition”: similarity 0.94. “Customer churn” vs “pricing strategy”: similarity 0.31.

The Threshold Architecture

Analysis of 2,000 real-world entity pairs established two key cutoffs:

  • Cosine ≥ 0.92 — auto-merge, no confirmation (precision: 0.97, recall: 0.89)
  • 0.80 ≤ cosine < 0.92 — gray zone, send to LLM for review
  • Cosine < 0.80 — distinct entities, skip entirely

At 0.92, only 3 out of 100 merges prove wrong, while capturing nearly 9 in 10 genuine duplicates.

Three-Level Pipeline

Sequential filtering, cheapest first:

  1. Exact matching via canonical keys (normalized strings) — catches ~40% of duplicates at zero cost
  2. Cosine similarity via embeddings — resolves ~45% more, no LLM
  3. LLM confirmation only for the 0.80–0.92 range — minimal token spend on genuinely ambiguous cases

A Levenshtein layer (distance ≤ 2) sits between levels one and two to catch typographical variations.

Real Production Examples

Multilingual pairs work too. “Unit economics” matches “юнит-экономика” at 0.93 similarity. “MVP launch” aligns with “запуск MVP” — using multilingual embedding models like BAAI’s bge-m3.

Economic Impact

500 nodes, 20 new entities daily:

ApproachDaily calls
Pure LLM10,000
Three-level resolver1,500
Savings85%

Conservative estimate. Established terminology pushes this toward 92%.

Practical Notes

  • Track embedding versions — regenerate when models update
  • Short strings (< 4 characters) need higher thresholds (0.96 vs 0.92), they’re unreliable at standard cutoffs
  • Add node type context to embeddings to distinguish homonyms
  • Use approximate nearest neighbor indexing for O(n) lookup

Entity resolution doesn’t have to be “cheap and bad” vs “expensive and good.” Three levels delivers LLM-quality results at 80-90% cost reduction — by only calling the model when you genuinely can’t avoid it. The best LLM architecture is the one where LLM gets invoked as rarely as possible.