Semantic Entity Resolution Without LLM: When Cosine > 0.92 Is Enough
Knowledge graphs built from unstructured text have a persistent problem: the same real-world entity shows up under different names. “Customer churn” and “user attrition.” “Margin” and “gross margin.” “Ivanov I.I.” and “Ivan Ivanov.” Same concept, different text.
Entity resolution — figuring out whether two mentions refer to the same thing — typically means expensive LLM verification. With 1,000 nodes, that’s 499,500 pairs to check. Even with cheap models, a full deduplication cycle costs tens of dollars.
The fix: handle 95% of cases without any LLM calls.
Why Levenshtein Fails
Edit distance catches typos and morphological variations fine. It completely breaks against synonyms and paraphrases. “Customer churn” and “user attrition” have maximum edit distance despite being semantically identical — because synonym pairs are the primary duplicate source in real research data, not typos.
Embeddings + Cosine Similarity
Text embeddings map phrases into high-dimensional vectors where semantically similar terms cluster nearby. Cosine similarity measures the angle between vectors — exactly the semantic proximity metric we need.
“Customer churn” vs “user attrition”: similarity 0.94. “Customer churn” vs “pricing strategy”: similarity 0.31.
The Threshold Architecture
Analysis of 2,000 real-world entity pairs established two key cutoffs:
- Cosine ≥ 0.92 — auto-merge, no confirmation (precision: 0.97, recall: 0.89)
- 0.80 ≤ cosine < 0.92 — gray zone, send to LLM for review
- Cosine < 0.80 — distinct entities, skip entirely
At 0.92, only 3 out of 100 merges prove wrong, while capturing nearly 9 in 10 genuine duplicates.
Three-Level Pipeline
Sequential filtering, cheapest first:
- Exact matching via canonical keys (normalized strings) — catches ~40% of duplicates at zero cost
- Cosine similarity via embeddings — resolves ~45% more, no LLM
- LLM confirmation only for the 0.80–0.92 range — minimal token spend on genuinely ambiguous cases
A Levenshtein layer (distance ≤ 2) sits between levels one and two to catch typographical variations.
Real Production Examples
Multilingual pairs work too. “Unit economics” matches “юнит-экономика” at 0.93 similarity. “MVP launch” aligns with “запуск MVP” — using multilingual embedding models like BAAI’s bge-m3.
Economic Impact
500 nodes, 20 new entities daily:
| Approach | Daily calls |
|---|---|
| Pure LLM | 10,000 |
| Three-level resolver | 1,500 |
| Savings | 85% |
Conservative estimate. Established terminology pushes this toward 92%.
Practical Notes
- Track embedding versions — regenerate when models update
- Short strings (< 4 characters) need higher thresholds (0.96 vs 0.92), they’re unreliable at standard cutoffs
- Add node type context to embeddings to distinguish homonyms
- Use approximate nearest neighbor indexing for O(n) lookup
Entity resolution doesn’t have to be “cheap and bad” vs “expensive and good.” Three levels delivers LLM-quality results at 80-90% cost reduction — by only calling the model when you genuinely can’t avoid it. The best LLM architecture is the one where LLM gets invoked as rarely as possible.