GuideSalesforceDedupeAudit-first

Salesforce dedupe for teams that want to inspect the mess before they merge it

The buyer problem is not subtle. Duplicate leads keep reappearing, reps merge the wrong contacts, two account records fight over ownership, and the team does not trust the duplicate warnings Salesforce shows at save time. Native duplicate rules help with some entry-time alerts. They do not give you a reviewable cluster queue, a supervised merge plan, or receipts you can verify later.

The Gremlin pattern is audit-first dedupe: preflight the org, cluster probable duplicates with blocking-first matching, route the clusters to human review, then apply only approved operations with a receipt file and a follow-up verify step.

The three things buyers actually need

Most duplicate-cleanup projects fail because the team jumps from "we have duplicates" straight to "run the merge." The safer sequence is: investigate how the duplicates cluster, decide what belongs together, then execute under guardrails.

Audit before you merge

Preflight checks org identity, converted LeadStatus values, duplicate rules, and whether lead conversion can execute cleanly.

Cluster before you decide

The planner groups duplicate candidates into clusters with anchors, confidence, a recommended action, and a suggested survivor.

Operators approve the plan

Review happens in CSV or Sheets with approval_status, reviewer_notes, and override_master_id - not in a hidden score threshold.

Every apply leaves evidence

Dry-runs and live applies can emit receipt files with plan digest, operation counts, skips, failures, and successes.

Why Salesforce duplicate rules stop early

Salesforce duplicate management is built around matching rules and duplicate rules. That is useful for spotting possible duplicates when a record is saved. It is not the same thing as clustering a messy CRM offline and preparing a supervised merge run.

The native model is record-save oriented: a user creates or updates a record, Salesforce checks matching rules, then a duplicate rule decides whether to alert or block. That catches some obvious cases, but it does not hand you a durable review queue with suggested survivors, cluster-level notes, and apply receipts.

Salesforce also makes you think in rules, not in clusters. You can have multiple rules on an object, but the engine still evaluates candidate duplicates through matching-rule logic and duplicate-record-set output. That is a different operating surface from "here are the eight records that collapse into one real customer, now approve or hold the cluster."

Gremlin takes the opposite route. It plans first. Blocking anchors shrink the search space, pair scoring creates duplicate edges, union-find turns those edges into clusters, and the human review queue becomes the real decision point.

Audit-first difference

Alert-or-block is not cluster review

Native duplicate rules surface possible duplicates at save time.
Gremlin surfaces clusters, survivors, anchors, review notes, and approvals.
Native rules do not give you dry-run apply receipts and verify steps.
If lead dedupe touches conversion or routing behavior, inspect Lead Status before you run merges.

The four-phase loop in shipped code

This is the actual flow in the Salesforce dedupe path: preflight, cluster, review, merge, then verify.

Audit and preflight

Check org identity, converted LeadStatus values, duplicate rules, and anonymous Apex support before planning or applying anything.

Cluster

Run g-gremlin dedup enterprise-plan to build a MergePlan/v2 with blocking-first candidate generation, cluster types, recommended action, survivor, anchors, and review notes.

Review

Use --cluster-review-output to hand operators a cluster queue with approval_status, override_master_id, reviewer_notes, member IDs, confidence, and golden-record suggestions.

Merge and verify

Dry-run g-gremlin sfdc merge-apply-plan by default, then execute only approved operations with --apply, receipts, resumable state, and follow-up verify checks.

Plan the clusters

Real command names, not pseudocode.

g-gremlin dedup enterprise-plan \
  --source Account=accounts.csv \
  --source Contact=contacts.csv \
  --source Lead=leads.csv \
  --profile b2b_person \
  --output plan.json \
  --review-output review_rows.csv \
  --cluster-review-output review_clusters.csv \
  --overwrite

Dry-run, then apply

The CLI defaults to preview mode until --apply is present.

g-gremlin sfdc merge-apply-plan \
  --plan plan.json \
  --approval-file review_clusters.csv \
  --receipt-file receipt.json
g-gremlin sfdc merge-apply-plan \
  --plan plan.json \
  --approval-file review_clusters.csv \
  --receipt-file receipt.json \
  --state-file state.json \
  --apply

What blocking-first means in practice

Blocking-first means Gremlin does not start with an all-pairs comparison across the whole CRM export. It first groups records by high-signal anchors, then scores pairs within those candidate components. That keeps the review set smaller and makes the evidence easier to explain.

email

Exact email anchor used to seed high-confidence candidate components.

phone

Exact phone anchor after digit normalization, typically requiring at least seven digits.

domain_name

Normalized company domain plus normalized name, excluding consumer domains, to catch business-email duplicates without a full scan.

company_name

Normalized company plus normalized name, excluding noisy company values, to catch rows that share employer identity but not a clean email anchor.

Algorithm families available in code

The enterprise planner leans on ensembles by field, while the core engine exposes a broader list of algorithm families. This is why the content should talk about exact anchors plus multiple scoring algorithms, not one magic match rule.

Email ensemble

Exact, JaroWinkler, WRatio

Phone ensemble

Exact, JaroWinkler

Domain ensemble

TokenSetRatio, WRatio, JaroWinkler

Name and company ensemble

WRatio, TokenSetRatio, PartialRatio

Core engine families

WRatio, QRatio, Ratio, PartialRatio, TokenSetRatio, TokenSortRatio, Jaro, JaroWinkler, Soundex, Exact, Domain, ExactDomain

What the operator actually reviews

The review surface is cluster-first. That matters because people do not approve merges one row at a time in a vacuum. They approve a cluster, inspect the anchor evidence, accept or reject the recommended action, and optionally override the survivor.

Review queue fields

cluster_id

Stable cluster key used in review queues, apply operations, skips, and receipts.

cluster_type

same_object, lead_to_contact, or cross_object.

recommended_action

merge or lead_to_contact_resolution based on object mix inside the cluster.

anchors

The blocking evidence that pulled the records together in the first place.

merge_confidence

Heuristic confidence derived from anchors such as email, phone, domain_name, or company_name.

review_notes

Human-readable cautions such as shared inbox risk, missing exact anchors, large clusters, or mixed Lead/Contact review.

approval_status

approved, hold, rejected, or pending once the operator fills out the queue.

override_master_id

Optional operator override for the survivor record.

Example cluster queue row

Gremlin exports CSV for review, then the same file can gate apply.

cluster_id,cluster_type,recommended_action,merge_confidence,golden_record_id,anchors,review_notes,approval_status,override_master_id
cluster_0042,same_object,merge,0.99,003xx00001AAA,email,"Exact email but names diverge; verify shared inbox",approved,
cluster_0049,lead_to_contact,lead_to_contact_resolution,0.95,003xx00001BBB,phone|company_name,"Mixed Lead and Contact cluster; convert only after human review",hold,

Reason surfaces you can talk about

anchors_applied

The matched anchor used during blocking or pair scoring, surfaced on pair outputs and cluster payloads.

Email->Email

Pair-level evidence for exact email alignment when that signal is present.

Name->Name

Pair-level name similarity evidence when name similarity clears threshold.

Company->Company

Pair-level evidence for exact company alignment on Lead comparisons.

evidence_strength and evidence_flags

Lower-level pair outputs can carry a rollup score plus flags such as email_exact, name_strong, company_match, or ensemble_signals.

Receipt fields that matter

command and plan digest

The receipt stores the command name, plan path, plan digest, and approval-file digest when approvals are used.

execution posture

dry_run, resume, org alias, and optional state-file path.

operation evidence

All skipped, succeeded, and failed operations are serialized into the receipt payload.

verification path

Verification is separate: merge verify checks master/victim state, and convert-lead verify checks converted status and resulting records.

{
  "version": 1,
  "command": "g-gremlin sfdc merge-apply-plan",
  "plan": "plan.json",
  "plan_digest": "8dd3...",
  "approval_file": "review_clusters.csv",
  "dry_run": true,
  "operations_planned": 6,
  "operations_skipped": 2,
  "operations_succeeded": 0,
  "operations_failed": 0
}

When vendors are the right answer

Dedupely, Plauti, and Cloudingo are not the enemy here. They are often the better pick when you want broad in-org cleanup, larger admin-facing data quality programs, or native object coverage beyond a supervised plan-review-apply loop. The Gremlin advantage is not that other tools cannot dedupe. It is that the architecture starts with audit and human review.

Salesforce duplicate rules

Best for

Entry-time alerts and blocking on fields you can encode into matching rules.

Good at

Stopping some obvious duplicates before save and creating duplicate record sets for review.

Where the difference shows up

They are rule-driven and record-save oriented, not offline cluster planning with explicit survivor review, receipts, and staged execution.

Dedupely

Best for

Continuous duplicate control with customizable merge rules and filters across native and custom Salesforce objects.

Good at

Admin-friendly cleanup, merge controls, and ongoing duplicate maintenance inside the Salesforce data stack.

Where the difference shows up

The core mental model is in-app duplicate management, not an offline audit-first loop with plan artifacts, approval CSVs, receipts, and verify steps.

Plauti

Best for

Salesforce-native dedupe with review queues, auto-merge options, merge rules, and broader object coverage.

Good at

Accounts, Contacts, Leads, custom objects, large data volumes, and org-native operational workflows.

Where the difference shows up

Plauti is built to resolve duplicates in-platform. Gremlin wins when the buyer wants an export-plan-review-apply path with explicit human gating and CLI artifacts.

Cloudingo

Best for

Admin-led Salesforce cleanup, import dedupe, and broader data hygiene tasks with no-code filters and rules.

Good at

Find, merge, prevent, import, standardize, and bulk-clean records in Salesforce and adjacent import flows.

Where the difference shows up

It is a full data-cleaning toolchain, not a narrower audit-first cluster review workflow for supervised merge plans and post-run verification.

Gremlin audit-first dedupe

Best for

Operators who want to inspect duplicates before merging, review clusters in CSV or Sheets, and keep receipts on supervised apply.

Good at

Blocking-first planning, human approval queues, dry-runs, resumable execution, and Salesforce verify checks.

Where the difference shows up

The public workflow is strongest for Salesforce Contact and Lead dedupe. It is not the broadest native object-cleanup platform, and I did not verify a rollback command.

FAQ

How is this different from Salesforce duplicate rules?

Salesforce duplicate rules look for possible duplicates when a record is created or updated. Gremlin builds an offline merge plan first, groups duplicate rows into clusters, lets a human approve or reject each cluster, dry-runs the apply step by default, and can verify the results afterward. The difference is not just matching logic. It is the review and execution contract.

Does Gremlin auto-merge everything?

No. The shipped Salesforce flow is supervised. The planner writes review queues with approval_status and optional override_master_id. The apply command is dry-run by default and only executes live when you pass --apply. Complex mixed Lead and Contact clusters can still be skipped for manual review.

What can the Salesforce apply layer merge today?

The apply layer supports same-object Account, Contact, and Lead merges. It also supports a narrower Lead-to-Contact conversion path for approved mixed clusters when the cluster is exactly one Lead plus one Contact and a converted-status value is provided.

Is there an undo button?

I did not verify a shipped dedupe-specific rollback command in the Salesforce workflow. What exists today is dry-run by default, resumable state, receipts, and post-run verification. That is safer than blind bulk merge, but it is not the same thing as true undo.

When should I pick Plauti, Dedupely, or Cloudingo instead?

Pick those tools when the real job is broad in-org data cleanup, account and custom-object maintenance inside Salesforce, auto-merge rules, or admin-led bulk operations. Pick audit-first dedupe when the job is to inspect the duplicate problem first, route ambiguity to humans, and keep an explicit plan-review-apply-verify loop with artifacts outside the org.

Keep the conversation going

These pages are meant to help operators solve real problems. If you want the next guide, grab the low-friction option. If you need the implementation, not just the guide, book time.

Stay in the loop

Get the next guide when it ships

I publish architecture guides grounded in real implementations. No generic AI filler.

Use your work email so I can keep the list useful and relevant.

Book Mike directly

Need the implementation, not just the guide?

Book a 15-minute working session with Mike right on his calendar. Tooling, consulting, or a mix of both is fine.

Open Mike's calendar

If you want me to come in with context, leave your email and a short note before the call.

I'll route new requests into the internal website inquiries inbox so I can follow up fast.