| ← Back to Index | Configuration | Live Mode |
Enroll Mode
Enroll mode runs a single-pool entity resolution server. Instead of matching records between two distinct datasets (A vs B), records are enrolled into one growing pool and scored against everything already there. The result is a list of scored edges — ready for external graph clustering or deduplication.
When to use enroll mode
- You have one dataset and want to find duplicates within it
- You’re building an entity resolution graph where the output is edges (not 1:1 pairs)
- There is no canonical reference master — all records are peers
- Records arrive one at a time and need to be resolved against everything seen so far
For two-sided matching (reference master A vs query stream B), use live mode instead.
Starting the server
meld enroll --config enroll_config.yaml --port 8090
| Flag | Short | Description |
|---|---|---|
--config |
-c |
Path to YAML config file (required) |
--port |
-p |
TCP port to listen on (default: 8080) |
Configuration
Enroll mode uses a simplified config format — single field names
instead of field_a/field_b pairs, a single dataset: instead of
datasets.a/datasets.b, and no cross_map section.
job:
name: entity_resolution
# Optional — pre-load a reference pool at startup. No edges are
# generated for these records. Subsequent /enroll calls score against
# this pool and add to it.
dataset:
path: reference_entities.csv
id_field: entity_id
embeddings:
model: melder/arctic-embed-xs-entity-resolution
cache_dir: cache/embeddings
blocking:
enabled: true
operator: and
fields:
- { field: country_code }
match_fields:
- { field: legal_name, method: embedding, weight: 0.55 }
- { field: short_name, method: fuzzy, scorer: partial_ratio, weight: 0.20 }
- { field: country_code, method: exact, weight: 0.20 }
- { field: lei, method: exact, weight: 0.05 }
thresholds:
auto_match: 0.85
review_floor: 0.60
performance:
encoder_pool_size: 4
vector_backend: flat
top_n: 20
Key differences from live mode config
| Aspect | Live mode | Enroll mode |
|---|---|---|
| Fields | field_a + field_b |
field |
| Datasets | datasets.a + datasets.b |
dataset (singular, optional) |
| Embedding cache | a_cache_dir + b_cache_dir |
cache_dir |
| Crossmap | Required (cross_map section) |
Not used |
| Output paths | Required (output section) |
Not used |
Endpoints
All endpoints are under /api/v1/. Only enroll-specific endpoints are
mounted — the A/B, crossmap, and review endpoints from live mode are
not available.
| Method | Path | Description |
|---|---|---|
| POST | /enroll |
Score a record against the pool, add it, return edges |
| POST | /enroll-batch |
Score and enroll multiple records sequentially |
| POST | /enroll/remove |
Remove a record from the pool |
| GET | /enroll/query?id=X |
Look up a record by ID |
| GET | /enroll/count |
Number of records in the pool |
| POST | /exclude |
Exclude a pair (known non-match) |
| DELETE | /exclude |
Remove an exclusion |
| GET | /health |
Health check |
| GET | /status |
Server status (uptime, enrollment count) |
Enrolling a record
When you enroll a record, Melder:
- Encodes it (embedding vector)
- Searches the pool for candidates (ANN, BM25, blocking)
- Scores candidates on all match fields
- Adds the record to the pool (store, vector index, BM25, blocking)
- Returns scored edges above
review_floor, capped attop_n
The record is added after scoring, so it never matches itself.
Request
curl -X POST http://localhost:8090/api/v1/enroll \
-H 'Content-Type: application/json' \
-d '{
"record": {
"entity_id": "ent_003",
"legal_name": "Goldman Sachs International",
"country_code": "GB",
"lei": "W22LROWP2IHZNBB6K528"
}
}'
Response
{
"id": "ent_003",
"enrolled": true,
"edges": [
{
"id": "ent_001",
"score": 0.94,
"field_scores": [
{ "field": "legal_name", "method": "embedding", "score": 0.97, "weight": 0.55 },
{ "field": "country_code", "method": "exact", "score": 1.0, "weight": 0.20 },
{ "field": "lei", "method": "exact", "score": 1.0, "weight": 0.05 }
]
},
{
"id": "ent_002",
"score": 0.67,
"field_scores": [
{ "field": "legal_name", "method": "embedding", "score": 0.72, "weight": 0.55 },
{ "field": "country_code", "method": "exact", "score": 1.0, "weight": 0.20 },
{ "field": "lei", "method": "exact", "score": 0.0, "weight": 0.05 }
]
}
]
}
The edges array contains all matches scoring at or above review_floor,
up to top_n results, sorted by score descending. There is no
classification field — the caller decides what to do with the scores.
Batch enrollment
POST /enroll-batch enrolls multiple records in a single request.
Records are processed sequentially — record N is scored against the
pool including records 1..N-1 from the same batch. This means
intra-batch edges are discovered.
Request
curl -X POST http://localhost:8090/api/v1/enroll-batch \
-H 'Content-Type: application/json' \
-d '{
"records": [
{ "entity_id": "ent_004", "legal_name": "Morgan Stanley & Co", "country_code": "US" },
{ "entity_id": "ent_005", "legal_name": "MS International", "country_code": "US" }
]
}'
Response
{
"results": [
{
"id": "ent_004",
"enrolled": true,
"edges": []
},
{
"id": "ent_005",
"enrolled": true,
"edges": [
{ "id": "ent_004", "score": 0.78, "field_scores": [...] }
]
}
]
}
Initial dataset
The optional dataset: section in the config pre-loads records into the
pool at startup. No edges are generated for these records — they are
loaded directly into the store and indices. This is the “known entities”
reference set.
Edges only come from subsequent /enroll or /enroll-batch calls. If
you need all-pairs edges within the initial dataset, use meld run
with the same dataset on both sides, or call /enroll-batch after
startup.
No crossmap
Enroll mode has no crossmap. A record can have edges to multiple pool members. There is no 1:1 constraint and no auto-match/review/no-match classification. The caller is responsible for deciding which connections are real matches — typically via graph clustering.
Concurrency
Multiple clients can call /enroll simultaneously. Records are added
to the pool as they are processed, so concurrent enrollments may or may
not see each other depending on timing. This is non-deterministic by
design — the same behaviour as live mode’s concurrent add calls.
If deterministic edge discovery is required, serialize calls or use
/enroll-batch.
Persistence
The pool persists across restarts using the same WAL mechanism as live mode. On shutdown, the embedding index cache is saved to disk. On restart, the WAL is replayed and the cache is reloaded.
Hooks
Hooks work in enroll mode. Since there is no crossmap or classification,
only on_nomatch fires — when an enrolled record has zero edges above
review_floor. See Hooks for configuration and examples.
Upsert semantics
Enrolling a record with an ID that already exists replaces the existing record, re-scores against the pool, and returns new edges.