Inside Melder

A developer's guide to the entity resolution engine that matches messy vendor records against clean reference data — using embeddings, BM25, fuzzy matching, and a few clever tricks.

~27,000 lines of Rust · Single crate · Two modes
Scroll to begin ↓
01

What Melder Does and Why

The entity resolution problem and how Melder solves it

The Problem: Names Don't Match

A bank has a clean reference master of counterparties. Every day, vendor files arrive with the same entities named differently. A human can see they're the same. A computer can't — unless you teach it.

Reference Master (Side A)Vendor File (Side B)
Goldman Sachs InternationalGS Intl Ltd
JPMorgan Chase & Co.JP Morgan Chase
Deutsche Bank AGDB AG Frankfurt
Barclays Bank PLCBarclays Bk

Melder takes both datasets, compares every record using multiple scoring methods, and produces three outputs: auto-matched pairs (high confidence), review pairs (borderline), and unmatched records.

The Config: Your Scoring Recipe

Everything is driven by a single YAML file. You define which fields to compare, how to compare them, and how much each comparison matters.

YAML
match_fields:
  - field_a: legal_name
    field_b: counterparty_name
    method: embedding
    weight: 0.55
  - field_a: short_name
    field_b: counterparty_name
    method: fuzzy
    weight: 0.20
  - field_a: country_code
    field_b: domicile
    method: exact
    weight: 0.20
PLAIN ENGLISH

Compare legal_name vs counterparty_name using neural embeddings (understands synonyms). This matters most — 55% of the score.

Also compare short_name vs counterparty_name using fuzzy string matching (edit distance). Worth 20%.

Check if country_code exactly equals domicile. Worth 20%. Simple but effective — eliminates cross-country false positives.

The Core Data Type

At the heart of everything is one remarkably simple type:

src/models.rs
pub type Record = HashMap<String, String>;
PLAIN ENGLISH

A Record is just a flat key-value map. Every field is a string. No schemas, no types — just "legal_name" → "Goldman Sachs International". Every module in the codebase works with this one type.

💡
Design Choice

Using HashMap<String, String> instead of typed structs means Melder works with any dataset shape — no code changes needed when field names differ between clients.

Two Modes of Operation

📦

Batch Mode

meld run — load both CSVs, match everything in parallel via Rayon, write results.csv, review.csv, unmatched.csv. Use for initial loads and periodic reconciliation.

🔄

Live Mode

meld serve — start an HTTP server with both datasets preloaded. New records via API are matched instantly and results returned in the response. Use for real-time integration.

Why can't you use simple string equality to match records across datasets?

02

How Machines Understand Meaning

Embeddings, semantic search, and why they matter for record matching

The Problem with Words

Computers are great at comparing exact strings. But record matching needs something deeper. Consider:

🤔

"Goldman Sachs International" and "GS Intl Ltd" share almost no characters. But a human instantly knows they're the same entity. How?

You understand the meaning behind the words — "GS" is an abbreviation for "Goldman Sachs", "Intl" means "International", "Ltd" is a legal suffix. You're doing semantic reasoning, not character comparison.

This is the core insight behind embeddings: teach a machine to convert text into a mathematical representation of its meaning, so that similar meanings end up close together — even when the words look completely different.

What is an Embedding?

An embedding is a list of numbers — typically 384 of them — that represent the meaning of a piece of text. Think of it as a coordinate in a 384-dimensional space.

CONCEPTUAL
// A neural network reads text and outputs a vector of numbers
"Goldman Sachs International"  →  [0.12, -0.34, 0.56, ... 384 numbers]
"GS Intl Ltd"                  →  [0.11, -0.33, 0.55, ... 384 numbers]
"Apple Inc"                    →  [0.87, 0.42, -0.19, ... 384 numbers]

// Goldman Sachs vectors are very close together
// Apple's vector is far away in a different direction
PLAIN ENGLISH

A neural network (the "model") reads text and produces a fixed-size list of numbers. This is the embedding.

Texts with similar meanings produce similar numbers — their vectors point in roughly the same direction.

Texts with different meanings produce different numbers — their vectors point in different directions.

The magic is that this works even when the words are completely different. The model has been trained on millions of text examples and learned that "GS" and "Goldman Sachs" appear in similar contexts — so it maps them to similar vectors.

Semantic Search: Finding Similar Meanings

Once you have embeddings, finding similar records becomes a geometry problem: which vectors in my database point in the same direction as my query vector?

📐
Cosine Similarity

The similarity between two embeddings is measured by the angle between them. If they point in the same direction, cosine similarity = 1.0 (identical meaning). If perpendicular, cosine similarity = 0.0 (unrelated). This is how Melder decides "how similar are these two entity names?"

But searching through millions of vectors by comparing each one would be far too slow. That's where HNSW comes in — a graph-based index that finds the most similar vectors in O(log N) time, without checking every single one.

What Embeddings Are Good (and Bad) At

Great at

Bridging abbreviations, synonyms, and rephrasings. "Goldman Sachs International" ≈ "GS Intl Ltd". No shared tokens needed.

Great at

Understanding context. "Apple Inc" (technology) vs "Apple Farms" (agriculture) — the model knows these are different kinds of entities.

Weak at

Exact identifiers. "LEI: 5493001KJTIIGC8Y1R12" vs "LEI: 5493001KJTIIGC8Y1R13" — one digit difference, but embeddings may see these as nearly identical. Use exact matching for IDs.

Weak at

Very short text. A 2-character abbreviation like "GS" has limited context for the model to work with. BM25 (token frequency scoring) often helps here.

This is why Melder doesn't rely on embeddings alone — it combines them with BM25 (token overlap), fuzzy matching (edit distance), exact matching (for IDs), and synonym detection. Each method catches what the others miss.

Embeddings in Melder

config.yaml
# The embedding model — a small, fast neural network
embeddings:
  model: themelder/arctic-embed-xs-entity-resolution

# How embeddings are used in scoring
match_fields:
  - field_a: legal_name
    field_b: counterparty_name
    method: embedding     # ← semantic similarity
    weight: 0.50
  - method: bm25           # ← token frequency overlap
    weight: 0.50
PLAIN ENGLISH

Melder loads a small neural network (22M parameters, ~86MB) that specialises in entity name understanding.

For each record, it encodes the name field into a 384-dimensional vector. This takes ~5ms per text on CPU.

At match time, it compares vectors using cosine similarity. The embedding score is combined with BM25 using configurable weights.

Encoding is the most expensive step in the pipeline — it accounts for ~80% of per-request latency. Everything else (BM25, scoring, crossmap) takes under 2ms combined.

Two entity names share zero words in common: "JPM Chase" and "J.P. Morgan Chase & Co." — which scoring method would still recognise them as the same entity?

03

Meet the Cast

The modules, their responsibilities, and how they connect

The Source Tree

src/Single crate, ~27k lines
main.rsCLI entry — dispatches to cmd_run, cmd_serve, etc.
models.rsRecord, Side, MatchResult, Classification
error.rsTyped error hierarchy (thiserror)
config/YAML parsing, validation, derived fields
encoder/ONNX embedding inference pool
vectordb/Vector index: flat (O(N)) or usearch HNSW (O(log N))
matching/The scoring pipeline: blocking, candidates, score_pool
scoring/Per-field scorers: exact, fuzzy, embedding, numeric
bm25/Lock-free BM25 scoring with WAND early termination
fuzzy/Edit-distance scorers: wratio, partial_ratio, token_sort
synonym/Acronym generation + dictionary matching
crossmap/Bidirectional 1:1 match registry (A↔B)
store/Record storage: DashMap (memory) or SQLite
state/Runtime state, WAL, startup orchestration
session/Live session: upsert, match, claim, respond
api/Axum HTTP server and handlers
hooks/Pipeline event hooks (subprocess NDJSON)
cli/One file per subcommand (run, serve, tune, etc.)
batch/Rayon-parallel batch matching engine
data/Dataset loaders: CSV, JSONL, Parquet

The Main Actors

⚙️
Config

Parses YAML, validates weights sum to 1.0, derives required fields. Everything starts here.

🧠
Encoder Pool

N ONNX sessions behind Mutex locks. Converts text to 384-dim vectors. Round-robin slot acquisition.

📐
VectorDB

Stores embedding vectors. Flat (brute-force) or usearch (HNSW graph). Handles combined vector construction and caching.

🎯
Pipeline

The ONE scoring entry point: score_pool(). Blocking → candidates → full scoring → classification. All modes use this.

🔗
CrossMap

The confirmed match registry. Enforces strict 1:1 bijection — each A can match at most one B, and vice versa.

The Supporting Cast

💾

Store

DashMap-backed or SQLite. Holds all records with concurrent access.

📝

BM25

SimpleBm25 IDF-weighted text scoring with WAND early termination. Suppresses common tokens like "Holdings".

🔤

Fuzzy

Edit-distance scorers. wratio, partial_ratio, token_sort. Catches typos and abbreviations.

🔁

Synonym

Generates acronyms (Goldman Sachs → GS). Additive bonus weight.

📋

WAL

Write-ahead log for crash recovery. Append-only NDJSON.

🪝

Hooks

Non-blocking event dispatch to external subprocess via stdin.

04

The Scoring Pipeline

How score_pool() finds and ranks matches — the heart of Melder

One Pipeline to Rule Them All

Every matching flow — batch, live, enroll — calls the same function: score_pool() in src/matching/pipeline.rs. This is a constitutional principle of the codebase.

1
Blocking

Hash-based filter eliminates impossible pairs. A record from France is never compared against one from Japan. Eliminates 95%+ of candidates.

2
ANN Candidates

Vector similarity search via HNSW index. Returns top-k embedding-similar records.

3
BM25 Candidates

Full-text token overlap query via SimpleBm25. Returns records with the most informative shared tokens.

4
Synonym Candidates

Acronym index lookup. Finds "GS" when searching for "Goldman Sachs".

5
Union & Full Scoring

Merge all candidates, deduplicate, then score every match field (exact, fuzzy, embedding, BM25, synonym) and compute the weighted composite.

6
Classification

Score ≥ 0.85 → auto_match. Score ≥ 0.60 → review. Below → no_match.

The Scoring Methods

Exact

Binary 1.0 or 0.0. Case-insensitive string equality. Best for: identifiers, country codes, LEI.

🔤

Fuzzy

Edit-distance similarity (0.0–1.0). Catches typos and abbreviations. Uses rapidfuzz for Levenshtein.

🧠

Embedding

Neural cosine similarity. Understands meaning — "GS Intl" ≈ "Goldman Sachs International". Powered by ONNX Runtime.

📊

BM25

IDF-weighted token overlap. Common words like "Holdings" count less. Rare words like "Sachs" count more.

🔁

Synonym

Acronym matching. "Goldman Sachs" → "GS". Additive bonus — doesn't dilute other scores when absent.

How Candidates Are Found

Watch the candidate generators coordinate during a match for "GS Intl Ltd":

Pipeline Chat — Scoring "GS Intl Ltd"
0 / 6

A record has an unusual company name that no other record shares any tokens with. Which scoring method is most likely to find its match?

05

The Live Upsert Journey

What happens when a record arrives via HTTP — from request to response

The Full Data Flow

Click "Next Step" to trace a B-side upsert through the system:

🌐
HTTP
🧠
Encoder
💾
Store
🎯
Pipeline
🔗
CrossMap
Click "Next Step" to begin
Step 0 / 6

The Encoder Pool Pattern

Multiple ONNX sessions run in parallel. Each request tries to grab any free slot:

src/encoder/mod.rs
// Try round-robin
for encoder in &self.encoders {
    if let Ok(mut guard) = encoder.try_lock() {
        return guard
            .embed(text_vec, Some(Self::ENCODE_BATCH_SIZE))
            .map_err(|e| EncoderError::Inference(e.to_string()));
    }
}
// All busy — block on slot 0
let mut guard = self.encoders[0]
    .lock()
    .map_err(|e| ...)?;
PLAIN ENGLISH

Try each ONNX session in order using a non-blocking lock attempt.

If a session is free, use it immediately to encode text into a vector.

If all sessions are busy with other requests, fall back to waiting for session #0 to become free. This guarantees progress — no request waits forever.

The Text-Hash Skip

When a record is re-upserted with the same text in its embedding fields, Melder skips the expensive ONNX encoding entirely using a cheap hash check:

src/vectordb/texthash.rs
pub fn compute_text_hash(record: &Record, ...) -> u64 {
    let mut h: u64 = 0xcbf29ce484222325;
    for (field_a, field_b, _weight) in emb_specs {
        let text = record.get(field)...;
        for byte in text.bytes() {
            h ^= byte as u64;
            h = h.wrapping_mul(0x00000100000001b3);
        }
    }
    h
}
PLAIN ENGLISH

Compute a fast FNV-1a hash of all embedding field texts.

If this hash matches what's stored from the last encoding, skip the ~4ms ONNX call entirely.

This gives ~20% throughput gain for updates that change non-embedding fields (e.g., address, phone).

SimpleBm25: Lock-Free with WAND

🚀
Performance Win: 3.2x Throughput

The original Tantivy-backed BM25 index serialized concurrent workers through a write lock on every commit. Replacing it with a custom DashMap-based scorer (SimpleBm25) eliminated all locking — writes are instantly visible with no commit step. Throughput jumped from 461 → 1,460 req/s while removing ~40 transitive dependencies.

src/bm25/simple.rs
if blocked_ids.len() <= self.exhaustive_threshold {
    self.score_exhaustive(&term_idfs, blocked_ids, avg_dl, top_k)
} else {
    self.score_wand(&term_idfs, blocked_ids, avg_dl, top_k)
}
PLAIN ENGLISH

For small candidate pools (≤5,000 records), score every candidate exhaustively — it's fast enough and simple.

For large pools, switch to Block-Max WAND: walk posting lists and skip any document whose upper-bound score can't beat the current Kth-best. Guaranteed to return the same top-K results, but processes only 1-10% of documents.

06

The Clever Tricks

Engineering patterns that make Melder fast, correct, and resilient

CrossMap: Bijection Under One Lock

The CrossMap enforces that each A record matches at most one B record, and vice versa. The implementation is deceptively simple:

src/crossmap/memory.rs
fn claim(&self, a_id: &str, b_id: &str) -> bool {
    let mut g = self.inner.write()
        .unwrap_or_else(|e| e.into_inner());
    if g.a_to_b.contains_key(a_id)
        || g.b_to_a.contains_key(b_id)
    {
        return false;
    }
    g.a_to_b.insert(a_id.to_string(), b_id.to_string());
    g.b_to_a.insert(b_id.to_string(), a_id.to_string());
    true
}
PLAIN ENGLISH

Take a write lock on the inner state (two plain HashMaps: a→b and b→a).

Check both directions — if either ID is already taken, reject the claim.

Insert into both maps atomically. The single lock guarantees no race condition between the check and the insert.

🔒
Why Not DashMap?

DashMap uses per-shard locks. To check both directions atomically, you'd need locks on two different shards simultaneously — risking deadlock if two threads acquire them in opposite order. A single RwLock over two plain HashMaps avoids this entirely.

Combined Vector with sqrt(w) Scaling

Instead of N separate vector indices (one per embedding field), Melder stores ONE combined vector per record. Each per-field vector is scaled by the square root of its weight before concatenation.

The Mathematical Identity

dot(combined_A, combined_B) = Σ weight_i × cosine(field_i_A, field_i_B)

A single ANN query returns the correct weighted similarity. No second ONNX call needed — the per-field cosines are decomposed algebraically from the combined vectors.

WAL: Crash Recovery

src/state/upsert_log.rs
pub fn append(&self, event: &WalEvent) -> io::Result<()> {
    let wrapped = Timestamped {
        ts: iso8601_now(),
        event,
    };
    let line = serde_json::to_string(&wrapped)?;
    let mut w = self.writer.lock()?;
    w.write_all(line.as_bytes())?;
    w.write_all(b"\n")?;
    Ok(())
}
PLAIN ENGLISH

Wrap the event with an ISO timestamp.

Serialize to JSON, then append one line to the log file under a mutex.

On restart, the WAL is replayed to recover all upserts and crossmap changes since the last clean shutdown.

Three-Layer Cache Invalidation

1
Spec-hash in filename

Field names + weights + quantization are hashed into the cache filename. Any config change makes the old path unreachable.

2
Manifest sidecar

JSON file recording model name, blocking hash, record count, source fingerprint. Catches model swaps.

3
Per-record text-hash diff

O(N) comparison with no ONNX calls. Only changed records get re-encoded. Unchanged records keep their cached vectors.

Together, these give near-instant warm starts with automatic stale-cache detection. No manual cache busting needed.

Why does the CrossMap use a single RwLock over two HashMaps instead of two DashMaps?

07

The Big Picture

Tech stack, feature flags, and where to start contributing

The Tech Stack

axumHTTP framework. Type-safe routing, JSON extraction, state via Arc<Session>.
tokioAsync runtime. CPU-bound work via spawn_blocking. Graceful shutdown via select!.
fastembed + ortONNX Runtime wrapper. Downloads models, tokenizes text, runs neural network inference.
usearchHNSW approximate nearest neighbor index. O(log N) vector search. Feature-gated.
dashmapConcurrent HashMap. Lock-free reads, per-shard writes. Used for record storage and the BM25 index.
rayonData parallelism. Batch mode scores B records across all CPU cores simultaneously.
rusqliteSQLite bindings. Alternative to DashMap for durable storage at million-record scale.
rapidfuzzEdit distance library. Provides the core Levenshtein ratio for fuzzy matching.

Feature Flags

usearch

Enables the HNSW vector index backend. Without this, only the O(N) flat scan is available. Required for production workloads.

parquet-format

Enables loading datasets from Apache Parquet files. Adds parquet and arrow crate dependencies.

simd

Hardware-accelerated dot product via SimSIMD. Uses NEON (ARM), AVX2/AVX-512 (x86) for faster vector comparisons.

Standard production build: cargo build --release --features usearch

Where to Start Contributing

1
Read vault/project_overview.md

The single source of truth. Architecture, module map, current work, backlog.

2
Run the tests

cargo test --all-features — all 396 tests should pass in under 1 second.

3
Run a benchmark

python3 benchmarks/live/10kx10k_inject3k_usearch/warm/run_test.py — see the engine in action.

4
Trace an upsert

Start at api/handlers.rs::upsert_handler and follow the calls through Session → Pipeline → CrossMap.

5
Check the backlog

vault/todo.md has the task list. Pick something that interests you.

A user disables blocking on a 100k-record dataset and notices BM25 candidate generation slows down significantly. Based on what you've learned, why does the system handle this gracefully despite the large candidate pool?

08

See It In Action

How to build, run benchmarks, and watch Melder match records in your terminal

Building Melder

You need Rust installed (rustup.rs). Then from the project root:

TERMINAL
# Standard production build
cargo build --release --features usearch

# With all features
cargo build --release --features usearch,parquet-format

# Run the tests
cargo test --all-features
WHAT EACH DOES

Build the optimized binary with the HNSW vector index. This is the build you want for any real workload. Output: ./target/release/meld

Also enable Parquet dataset loading (adds ~30s compile time for the arrow crate).

Run all 396 tests. Should complete in under 1 second. Always run this before committing.

Batch Mode: Process Everything At Once

Batch benchmarks live in benchmarks/batch/. Each has a cold/ run (builds indices from scratch) and a warm/ run (loads cached indices).

1
Run a single batch benchmark

python3 benchmarks/batch/10kx10k_usearch/warm/run_test.py

Loads 10k records on each side, matches them in parallel, and prints throughput + timing. First run builds the embedding cache (~17s); second run loads from cache (~0.3s).

2
Check the output files

Look at benchmarks/batch/10kx10k_usearch/warm/output/ for results.csv (confirmed matches), review.csv (borderline pairs), and unmatched.csv (no match found).

3
Run all batch benchmarks

python3 benchmarks/batch/run_all_tests.py

Runs every batch benchmark (cold then warm) and prints a summary table. Budget ~45-60 minutes on Apple Silicon.

Live Mode: Real-Time HTTP Matching

Live benchmarks start an HTTP server, inject records via API, and measure throughput and latency. They live in benchmarks/live/.

1
Quick smoke test

python3 benchmarks/scripts/smoke_test.py --binary ./target/release/meld --config benchmarks/live/10kx10k_inject3k_usearch/warm/config.yaml

Starts the server, sends 10 upserts, prints each response with latency, then stops. Use this to verify everything works before running longer tests.

2
Run a concurrent injection test

python3 benchmarks/live/10kx10k_inject3k_usearch/warm/run_test.py

Injects 3,000 records with 10 concurrent workers. Prints per-operation latencies (p50/p95/p99) and overall throughput. Expect ~1,500 req/s on Apple Silicon.

3
Try the high-volume benchmark

python3 benchmarks/live/10kx10k_inject50k_usearch/warm/run_test.py

50,000 injections with CPU/GPU monitoring. Uses the production scoring config (Arctic-embed-xs + BM25). Prints a resource utilization summary at the end.

Try It Manually

You can also start the server yourself and send requests with curl:

TERMINAL 1
# Start the server
./target/release/meld serve \
  --config benchmarks/live/10kx10k_inject3k_usearch/warm/config.yaml \
  --port 8090
WHAT HAPPENS

Loads 10k records per side, builds (or loads cached) embedding indices, starts listening on port 8090. First cold start takes ~18s; subsequent warm starts ~1-2s.

TERMINAL 2
# Send a B-side record
curl -s -X POST http://localhost:8090/upsert \
  -H 'Content-Type: application/json' \
  -d '{
    "side": "B",
    "record": {
      "counterparty_id": "CP-TEST-001",
      "counterparty_name": "Goldman Sachs Intl",
      "domicile": "GB"
    }
  }' | python3 -m json.tool
WHAT YOU GET BACK

The response JSON includes the match status, the best matches with scores and per-field breakdowns, and the classification (auto/review/no_match). Try different names and watch how the scores change!

What the Benchmarks Measure

Throughput

Requests per second. The headline number. Production target: 1,000+ req/s at 10k records per side.

Latency (p50/p95/p99)

How long each request takes. p50 = median, p95 = 95th percentile (tail), p99 = worst 1%. For live mode, p50 should be under 10ms.

Operation Mix

new_a/new_b (inserts), upd_a_emb/upd_b_emb (embedding field changes), upd_a_field/upd_b_field (non-embedding changes). Encoding ops are slower because they need ONNX inference.

Resource Usage

The 50k benchmark monitors CPU (per-process and per-core), GPU, and memory. Look for whether the bottleneck is CPU (encoding), GPU, or lock contention.

📁
Benchmark Directory Layout

Each benchmark is self-contained: config.yaml (scoring config), run_test.py (test runner), cache/ (embedding indices), output/ (results CSVs), wal/ (write-ahead log). The helper scripts in benchmarks/scripts/ can also connect to a server you started manually with --no-serve.