A developer's guide to the entity resolution engine that matches messy vendor records against clean reference data — using embeddings, BM25, fuzzy matching, and a few clever tricks.
The entity resolution problem and how Melder solves it
A bank has a clean reference master of counterparties. Every day, vendor files arrive with the same entities named differently. A human can see they're the same. A computer can't — unless you teach it.
| Reference Master (Side A) | Vendor File (Side B) |
|---|---|
| Goldman Sachs International | GS Intl Ltd |
| JPMorgan Chase & Co. | JP Morgan Chase |
| Deutsche Bank AG | DB AG Frankfurt |
| Barclays Bank PLC | Barclays Bk |
Melder takes both datasets, compares every record using multiple scoring methods, and produces three outputs: auto-matched pairs (high confidence), review pairs (borderline), and unmatched records.
Everything is driven by a single YAML file. You define which fields to compare, how to compare them, and how much each comparison matters.
match_fields:
- field_a: legal_name
field_b: counterparty_name
method: embedding
weight: 0.55
- field_a: short_name
field_b: counterparty_name
method: fuzzy
weight: 0.20
- field_a: country_code
field_b: domicile
method: exact
weight: 0.20
Compare legal_name vs counterparty_name using neural embeddings (understands synonyms). This matters most — 55% of the score.
Also compare short_name vs counterparty_name using fuzzy string matching (edit distance). Worth 20%.
Check if country_code exactly equals domicile. Worth 20%. Simple but effective — eliminates cross-country false positives.
At the heart of everything is one remarkably simple type:
pub type Record = HashMap<String, String>;
A Record is just a flat key-value map. Every field is a string. No schemas, no types — just "legal_name" → "Goldman Sachs International". Every module in the codebase works with this one type.
Using HashMap<String, String> instead of typed structs means Melder works with any dataset shape — no code changes needed when field names differ between clients.
meld run — load both CSVs, match everything in parallel via Rayon, write results.csv, review.csv, unmatched.csv. Use for initial loads and periodic reconciliation.
meld serve — start an HTTP server with both datasets preloaded. New records via API are matched instantly and results returned in the response. Use for real-time integration.
Embeddings, semantic search, and why they matter for record matching
Computers are great at comparing exact strings. But record matching needs something deeper. Consider:
"Goldman Sachs International" and "GS Intl Ltd" share almost no characters. But a human instantly knows they're the same entity. How?
You understand the meaning behind the words — "GS" is an abbreviation for "Goldman Sachs", "Intl" means "International", "Ltd" is a legal suffix. You're doing semantic reasoning, not character comparison.
This is the core insight behind embeddings: teach a machine to convert text into a mathematical representation of its meaning, so that similar meanings end up close together — even when the words look completely different.
An embedding is a list of numbers — typically 384 of them — that represent the meaning of a piece of text. Think of it as a coordinate in a 384-dimensional space.
// A neural network reads text and outputs a vector of numbers
"Goldman Sachs International" → [0.12, -0.34, 0.56, ... 384 numbers]
"GS Intl Ltd" → [0.11, -0.33, 0.55, ... 384 numbers]
"Apple Inc" → [0.87, 0.42, -0.19, ... 384 numbers]
// Goldman Sachs vectors are very close together
// Apple's vector is far away in a different direction
A neural network (the "model") reads text and produces a fixed-size list of numbers. This is the embedding.
Texts with similar meanings produce similar numbers — their vectors point in roughly the same direction.
Texts with different meanings produce different numbers — their vectors point in different directions.
The magic is that this works even when the words are completely different. The model has been trained on millions of text examples and learned that "GS" and "Goldman Sachs" appear in similar contexts — so it maps them to similar vectors.
Once you have embeddings, finding similar records becomes a geometry problem: which vectors in my database point in the same direction as my query vector?
The similarity between two embeddings is measured by the angle between them. If they point in the same direction, cosine similarity = 1.0 (identical meaning). If perpendicular, cosine similarity = 0.0 (unrelated). This is how Melder decides "how similar are these two entity names?"
But searching through millions of vectors by comparing each one would be far too slow. That's where HNSW comes in — a graph-based index that finds the most similar vectors in O(log N) time, without checking every single one.
Bridging abbreviations, synonyms, and rephrasings. "Goldman Sachs International" ≈ "GS Intl Ltd". No shared tokens needed.
Understanding context. "Apple Inc" (technology) vs "Apple Farms" (agriculture) — the model knows these are different kinds of entities.
Exact identifiers. "LEI: 5493001KJTIIGC8Y1R12" vs "LEI: 5493001KJTIIGC8Y1R13" — one digit difference, but embeddings may see these as nearly identical. Use exact matching for IDs.
Very short text. A 2-character abbreviation like "GS" has limited context for the model to work with. BM25 (token frequency scoring) often helps here.
This is why Melder doesn't rely on embeddings alone — it combines them with BM25 (token overlap), fuzzy matching (edit distance), exact matching (for IDs), and synonym detection. Each method catches what the others miss.
# The embedding model — a small, fast neural network
embeddings:
model: themelder/arctic-embed-xs-entity-resolution
# How embeddings are used in scoring
match_fields:
- field_a: legal_name
field_b: counterparty_name
method: embedding # ← semantic similarity
weight: 0.50
- method: bm25 # ← token frequency overlap
weight: 0.50
Melder loads a small neural network (22M parameters, ~86MB) that specialises in entity name understanding.
For each record, it encodes the name field into a 384-dimensional vector. This takes ~5ms per text on CPU.
At match time, it compares vectors using cosine similarity. The embedding score is combined with BM25 using configurable weights.
Encoding is the most expensive step in the pipeline — it accounts for ~80% of per-request latency. Everything else (BM25, scoring, crossmap) takes under 2ms combined.
The modules, their responsibilities, and how they connect
Parses YAML, validates weights sum to 1.0, derives required fields. Everything starts here.
N ONNX sessions behind Mutex locks. Converts text to 384-dim vectors. Round-robin slot acquisition.
Stores embedding vectors. Flat (brute-force) or usearch (HNSW graph). Handles combined vector construction and caching.
The ONE scoring entry point: score_pool(). Blocking → candidates → full scoring → classification. All modes use this.
The confirmed match registry. Enforces strict 1:1 bijection — each A can match at most one B, and vice versa.
DashMap-backed or SQLite. Holds all records with concurrent access.
SimpleBm25 IDF-weighted text scoring with WAND early termination. Suppresses common tokens like "Holdings".
Edit-distance scorers. wratio, partial_ratio, token_sort. Catches typos and abbreviations.
Generates acronyms (Goldman Sachs → GS). Additive bonus weight.
Write-ahead log for crash recovery. Append-only NDJSON.
Non-blocking event dispatch to external subprocess via stdin.
How score_pool() finds and ranks matches — the heart of Melder
Every matching flow — batch, live, enroll — calls the same function: score_pool() in src/matching/pipeline.rs. This is a constitutional principle of the codebase.
Hash-based filter eliminates impossible pairs. A record from France is never compared against one from Japan. Eliminates 95%+ of candidates.
Vector similarity search via HNSW index. Returns top-k embedding-similar records.
Full-text token overlap query via SimpleBm25. Returns records with the most informative shared tokens.
Acronym index lookup. Finds "GS" when searching for "Goldman Sachs".
Merge all candidates, deduplicate, then score every match field (exact, fuzzy, embedding, BM25, synonym) and compute the weighted composite.
Score ≥ 0.85 → auto_match. Score ≥ 0.60 → review. Below → no_match.
Binary 1.0 or 0.0. Case-insensitive string equality. Best for: identifiers, country codes, LEI.
Edit-distance similarity (0.0–1.0). Catches typos and abbreviations. Uses rapidfuzz for Levenshtein.
Neural cosine similarity. Understands meaning — "GS Intl" ≈ "Goldman Sachs International". Powered by ONNX Runtime.
IDF-weighted token overlap. Common words like "Holdings" count less. Rare words like "Sachs" count more.
Acronym matching. "Goldman Sachs" → "GS". Additive bonus — doesn't dilute other scores when absent.
Watch the candidate generators coordinate during a match for "GS Intl Ltd":
What happens when a record arrives via HTTP — from request to response
Click "Next Step" to trace a B-side upsert through the system:
Multiple ONNX sessions run in parallel. Each request tries to grab any free slot:
// Try round-robin
for encoder in &self.encoders {
if let Ok(mut guard) = encoder.try_lock() {
return guard
.embed(text_vec, Some(Self::ENCODE_BATCH_SIZE))
.map_err(|e| EncoderError::Inference(e.to_string()));
}
}
// All busy — block on slot 0
let mut guard = self.encoders[0]
.lock()
.map_err(|e| ...)?;
Try each ONNX session in order using a non-blocking lock attempt.
If a session is free, use it immediately to encode text into a vector.
If all sessions are busy with other requests, fall back to waiting for session #0 to become free. This guarantees progress — no request waits forever.
When a record is re-upserted with the same text in its embedding fields, Melder skips the expensive ONNX encoding entirely using a cheap hash check:
pub fn compute_text_hash(record: &Record, ...) -> u64 {
let mut h: u64 = 0xcbf29ce484222325;
for (field_a, field_b, _weight) in emb_specs {
let text = record.get(field)...;
for byte in text.bytes() {
h ^= byte as u64;
h = h.wrapping_mul(0x00000100000001b3);
}
}
h
}
Compute a fast FNV-1a hash of all embedding field texts.
If this hash matches what's stored from the last encoding, skip the ~4ms ONNX call entirely.
This gives ~20% throughput gain for updates that change non-embedding fields (e.g., address, phone).
The original Tantivy-backed BM25 index serialized concurrent workers through a write lock on every commit. Replacing it with a custom DashMap-based scorer (SimpleBm25) eliminated all locking — writes are instantly visible with no commit step. Throughput jumped from 461 → 1,460 req/s while removing ~40 transitive dependencies.
if blocked_ids.len() <= self.exhaustive_threshold {
self.score_exhaustive(&term_idfs, blocked_ids, avg_dl, top_k)
} else {
self.score_wand(&term_idfs, blocked_ids, avg_dl, top_k)
}
For small candidate pools (≤5,000 records), score every candidate exhaustively — it's fast enough and simple.
For large pools, switch to Block-Max WAND: walk posting lists and skip any document whose upper-bound score can't beat the current Kth-best. Guaranteed to return the same top-K results, but processes only 1-10% of documents.
Engineering patterns that make Melder fast, correct, and resilient
The CrossMap enforces that each A record matches at most one B record, and vice versa. The implementation is deceptively simple:
fn claim(&self, a_id: &str, b_id: &str) -> bool {
let mut g = self.inner.write()
.unwrap_or_else(|e| e.into_inner());
if g.a_to_b.contains_key(a_id)
|| g.b_to_a.contains_key(b_id)
{
return false;
}
g.a_to_b.insert(a_id.to_string(), b_id.to_string());
g.b_to_a.insert(b_id.to_string(), a_id.to_string());
true
}
Take a write lock on the inner state (two plain HashMaps: a→b and b→a).
Check both directions — if either ID is already taken, reject the claim.
Insert into both maps atomically. The single lock guarantees no race condition between the check and the insert.
DashMap uses per-shard locks. To check both directions atomically, you'd need locks on two different shards simultaneously — risking deadlock if two threads acquire them in opposite order. A single RwLock over two plain HashMaps avoids this entirely.
Instead of N separate vector indices (one per embedding field), Melder stores ONE combined vector per record. Each per-field vector is scaled by the square root of its weight before concatenation.
dot(combined_A, combined_B) = Σ weight_i × cosine(field_i_A, field_i_B)
A single ANN query returns the correct weighted similarity. No second ONNX call needed — the per-field cosines are decomposed algebraically from the combined vectors.
pub fn append(&self, event: &WalEvent) -> io::Result<()> {
let wrapped = Timestamped {
ts: iso8601_now(),
event,
};
let line = serde_json::to_string(&wrapped)?;
let mut w = self.writer.lock()?;
w.write_all(line.as_bytes())?;
w.write_all(b"\n")?;
Ok(())
}
Wrap the event with an ISO timestamp.
Serialize to JSON, then append one line to the log file under a mutex.
On restart, the WAL is replayed to recover all upserts and crossmap changes since the last clean shutdown.
Field names + weights + quantization are hashed into the cache filename. Any config change makes the old path unreachable.
JSON file recording model name, blocking hash, record count, source fingerprint. Catches model swaps.
O(N) comparison with no ONNX calls. Only changed records get re-encoded. Unchanged records keep their cached vectors.
Together, these give near-instant warm starts with automatic stale-cache detection. No manual cache busting needed.
Tech stack, feature flags, and where to start contributing
axumHTTP framework. Type-safe routing, JSON extraction, state via Arc<Session>.tokioAsync runtime. CPU-bound work via spawn_blocking. Graceful shutdown via select!.fastembed + ortONNX Runtime wrapper. Downloads models, tokenizes text, runs neural network inference.usearchHNSW approximate nearest neighbor index. O(log N) vector search. Feature-gated.dashmapConcurrent HashMap. Lock-free reads, per-shard writes. Used for record storage and the BM25 index.rayonData parallelism. Batch mode scores B records across all CPU cores simultaneously.rusqliteSQLite bindings. Alternative to DashMap for durable storage at million-record scale.rapidfuzzEdit distance library. Provides the core Levenshtein ratio for fuzzy matching.usearchEnables the HNSW vector index backend. Without this, only the O(N) flat scan is available. Required for production workloads.
parquet-formatEnables loading datasets from Apache Parquet files. Adds parquet and arrow crate dependencies.
simdHardware-accelerated dot product via SimSIMD. Uses NEON (ARM), AVX2/AVX-512 (x86) for faster vector comparisons.
Standard production build: cargo build --release --features usearch
The single source of truth. Architecture, module map, current work, backlog.
cargo test --all-features — all 396 tests should pass in under 1 second.
python3 benchmarks/live/10kx10k_inject3k_usearch/warm/run_test.py — see the engine in action.
Start at api/handlers.rs::upsert_handler and follow the calls through Session → Pipeline → CrossMap.
vault/todo.md has the task list. Pick something that interests you.
How to build, run benchmarks, and watch Melder match records in your terminal
You need Rust installed (rustup.rs). Then from the project root:
# Standard production build
cargo build --release --features usearch
# With all features
cargo build --release --features usearch,parquet-format
# Run the tests
cargo test --all-features
Build the optimized binary with the HNSW vector index. This is the build you want for any real workload. Output: ./target/release/meld
Also enable Parquet dataset loading (adds ~30s compile time for the arrow crate).
Run all 396 tests. Should complete in under 1 second. Always run this before committing.
Batch benchmarks live in benchmarks/batch/. Each has a cold/ run (builds indices from scratch) and a warm/ run (loads cached indices).
python3 benchmarks/batch/10kx10k_usearch/warm/run_test.py
Loads 10k records on each side, matches them in parallel, and prints throughput + timing. First run builds the embedding cache (~17s); second run loads from cache (~0.3s).
Look at benchmarks/batch/10kx10k_usearch/warm/output/ for results.csv (confirmed matches), review.csv (borderline pairs), and unmatched.csv (no match found).
python3 benchmarks/batch/run_all_tests.py
Runs every batch benchmark (cold then warm) and prints a summary table. Budget ~45-60 minutes on Apple Silicon.
Live benchmarks start an HTTP server, inject records via API, and measure throughput and latency. They live in benchmarks/live/.
python3 benchmarks/scripts/smoke_test.py --binary ./target/release/meld --config benchmarks/live/10kx10k_inject3k_usearch/warm/config.yaml
Starts the server, sends 10 upserts, prints each response with latency, then stops. Use this to verify everything works before running longer tests.
python3 benchmarks/live/10kx10k_inject3k_usearch/warm/run_test.py
Injects 3,000 records with 10 concurrent workers. Prints per-operation latencies (p50/p95/p99) and overall throughput. Expect ~1,500 req/s on Apple Silicon.
python3 benchmarks/live/10kx10k_inject50k_usearch/warm/run_test.py
50,000 injections with CPU/GPU monitoring. Uses the production scoring config (Arctic-embed-xs + BM25). Prints a resource utilization summary at the end.
You can also start the server yourself and send requests with curl:
# Start the server
./target/release/meld serve \
--config benchmarks/live/10kx10k_inject3k_usearch/warm/config.yaml \
--port 8090
Loads 10k records per side, builds (or loads cached) embedding indices, starts listening on port 8090. First cold start takes ~18s; subsequent warm starts ~1-2s.
# Send a B-side record
curl -s -X POST http://localhost:8090/upsert \
-H 'Content-Type: application/json' \
-d '{
"side": "B",
"record": {
"counterparty_id": "CP-TEST-001",
"counterparty_name": "Goldman Sachs Intl",
"domicile": "GB"
}
}' | python3 -m json.tool
The response JSON includes the match status, the best matches with scores and per-field breakdowns, and the classification (auto/review/no_match). Try different names and watch how the scores change!
Requests per second. The headline number. Production target: 1,000+ req/s at 10k records per side.
How long each request takes. p50 = median, p95 = 95th percentile (tail), p99 = worst 1%. For live mode, p50 should be under 10ms.
new_a/new_b (inserts), upd_a_emb/upd_b_emb (embedding field changes), upd_a_field/upd_b_field (non-embedding changes). Encoding ops are slower because they need ONNX inference.
The 50k benchmark monitors CPU (per-process and per-core), GPU, and memory. Look for whether the bottleneck is CPU (encoding), GPU, or lock contention.
Each benchmark is self-contained: config.yaml (scoring config), run_test.py (test runner), cache/ (embedding indices), output/ (results CSVs), wal/ (write-ahead log). The helper scripts in benchmarks/scripts/ can also connect to a server you started manually with --no-serve.