← projects

DocSignal: Hybrid RAG over Dutch Immigration Policy

2026.07.04RAG · hybrid retrieval · semantic search · Claude API · Next.js

DocSignal is a retrieval-augmented Q&A system over a real, messy corpus: ~40 public pages of Dutch work-visa and immigration policy. Built while researching my own relocation — and used as a chance to build retrieval properly instead of wiring up a tutorial.

This is a technical demo, not immigration advice. The corpus is a dated snapshot; salary thresholds and policy change constantly. For current rules, go to ind.nl.

At a glance

  • Problem: Immigration policy is dense, cross-referenced, and full of exact tokens (salary thresholds, permit names) that pure semantic search mishandles.
  • Approach: Structure-aware chunking, hybrid dense + BM25 retrieval fused with RRF, Claude Haiku re-ranking, grounded generation with citations.
  • Stack: Next.js 14 · TypeScript · bge-small-en-v1.5 embeddings · BM25 · Claude API.
  • Headline result: Hybrid retrieval hits 96.7% Recall@5 vs 93.3% dense-only and 86.7% BM25-only on 30 hand-labelled questions.

Architecture & logic

Government policy pages are deep heading trees with the highest-value facts in tables. DocSignal chunks along document structure, keeps table rows attached to their headers, and prepends every chunk's breadcrumb to its text before embedding.

Immigration terminology is a worst case for pure semantic search — embeddings map "EU Blue Card" and "highly skilled migrant" close together, which is usually helpful and catastrophic when the user asks about one specifically. Exact tokens like "TWV", "€ 4,357.00" and "150 kilometres" are where keyword search earns its keep.

collect_corpus.ts  →  chunker.ts  →  build_index.ts


              retrieval.ts (dense + BM25 → RRF fusion)


                    rerank.ts (Claude Haiku → top 5)


                    generate.ts (grounded answer + [n] citations)

Benchmarks & results

From npm run eval — 30 hand-labelled questions, 310 chunks, bge-small-en-v1.5 embeddings:

| Retrieval mode | Recall@5 | MRR@10 | | --- | ---: | ---: | | Dense-only | 93.3% | 0.837 | | BM25-only | 86.7% | 0.751 | | Hybrid (RRF fusion) | 96.7% | 0.851 |

Recall@5 means: for each question, did at least one of the hand-identified relevant chunks appear in the top 5 results? MRR@10 is the mean reciprocal rank of the first relevant chunk (top 10, else 0).

Honest caveats: fusion parameters were swept against this same 30-question set, so hybrid figures are slightly optimistic; dense vs hybrid is 28 vs 29 hits out of 30 — one question, not a statistically robust margin. The per-mode failure analysis on /eval is more trustworthy than the headline percentages.

What I'd do differently

Hold out a query set. Thirty labelled questions is enough to expose failure modes, not enough to tune confidently. I would split into dev/test before sweeping fusion weights.

Cross-encoder re-ranking. Haiku re-ranking helps, but a dedicated cross-encoder (or a smaller bi-encoder fine-tuned on immigration terminology) would be the next step before scaling corpus size.

Freshness pipeline. Policy pages change. Production would need scheduled re-fetch, diff detection, and chunk invalidation — not a one-time corpus snapshot.

Stack & links

  • Next.js 14 (App Router) · TypeScript · Tailwind CSS
  • bge-small-en-v1.5 (local ONNX) · BM25 · Claude API
  • GitHub: DocSignal source

Related portfolio work: LeadSignal (rules + Claude enrichment for RevOps), SignalVision (ONNX int8 quantization for browser inference).