PostgreSQL as a Graph Database: CTE + pgvector + pg_trgm = No Neo4j Needed

“Graphs require a graph database” — you hear this constantly. It’s mostly wrong.

I built a production knowledge graph on PostgreSQL. No Neo4j. Graph databases solve one specific problem: traversing connections at arbitrary depth. For knowledge graphs supporting product analytics at typical query depths of 2-3 levels, PostgreSQL performs comparably.

PostgreSQL handles approximately 95% of graph use cases for systems up to 100K nodes with traversal depths of 4 or less.

The Stack: Four Extensions

Task	Tool	What It Gets You
Semantic search	`pgvector`	Cosine similarity, HNSW indexes
Fuzzy matching	`pg_trgm`	Trigram similarity, Levenshtein
Graph traversal	Recursive CTE	BFS/DFS at arbitrary depth
Full-text search	Native `tsvector`	Keyword search with morphology

One server. One pg_dump. One monitoring setup. Infrastructure simplicity is a real advantage.

Hybrid Search: Three Signals via RRF Fusion

Combine three search mechanisms, then fuse their results:

Signal 1: Vector Search (pgvector)

SELECT id FROM kg2_nodes
WHERE embedding IS NOT NULL
ORDER BY embedding <=> $query_embedding
LIMIT 40;

Signal 2: Keyword Search (tsvector)

SELECT id FROM kg2_nodes
WHERE to_tsvector('simple', label || ' ' || COALESCE(summary, ''))
      @@ plainto_tsquery('simple', $query);

Signal 3: Graph Traversal (Recursive CTE)

WITH RECURSIVE graph AS (
  SELECT target_node_id AS node_id, 1 AS depth
  FROM kg2_edges WHERE source_node_id IN ($seed_ids) AND invalid_at IS NULL
  UNION
  SELECT e.target_node_id, g.depth + 1
  FROM kg2_edges e JOIN graph g ON e.source_node_id = g.node_id
  WHERE g.depth < $max_depth AND e.invalid_at IS NULL
)
SELECT DISTINCT node_id FROM graph;

RRF Fusion in Ruby:

RRF_K = 60
def fuse(vector, keyword, graph, limit)
  scores = Hash.new(0.0)
  [vector, keyword, graph].each do |results|
    results.each_with_index do |node, rank|
      scores[node.id] += 1.0 / (RRF_K + rank + 1)
    end
  end
  scores.sort_by { |_, s| -s }.first(limit)
end

Performance at 1,000 Nodes

Operation	Time
Depth-2 CTE (1000 nodes, 650 edges)	< 5ms
pgvector nearest neighbor	< 2ms
Keyword search	< 1ms
Full hybrid search	< 15ms
Community detection (BFS)	< 50ms

Scaling Roadmap

Stage	Threshold	Action
S0	—	CTE + pgvector
S1	> 5K nodes	Real embeddings, HNSW tuning
S2	> 20K nodes	Materialized views, partitioning
S3	> 50K nodes	Read replica
S4	> 100K + variable-length paths	Apache AGE extension

Even at S4, you don’t need to migrate to Neo4j. Apache AGE adds Cypher support as a PostgreSQL extension, keeping your data in the same database.

When Neo4j Actually Makes Sense

More than 100K nodes with variable-length paths — MATCH (a)-[*1..10]->(b) in Cypher genuinely beats CTEs here
Built-in graph algorithms — PageRank, centrality via GDS
Team already fluent in Cypher

Ask yourself: how many of these apply today?

PostgreSQL is a complete solution, not a compromise. CTE + pgvector + pg_trgm + tsvector covers the full graph stack. With 1,000 nodes, 650 edges, hybrid search, and bi-temporal modeling, there’s no justification for adding Neo4j.

Don’t expand infrastructure before the problems demand it.