When duplicate records appear in very large warehouse tables.
Duplicate Detection at Scale AI Prompt
This prompt tackles duplicate detection in large tables where simple manual inspection is not enough. It combines exact duplicate checks, near-duplicate strategies, canonical record selection, and upstream prevention. It is particularly useful for high-volume datasets with compound keys and recurring deduplication problems.
Implement scalable duplicate detection for a large table ({{table_size}} rows) with a compound natural key.
Natural key: {{natural_key_columns}}
Table: {{table_name}}
1. Exact duplicate detection (fast):
```sql
SELECT {{natural_key_columns}}, COUNT(*) AS cnt
FROM {{table_name}}
GROUP BY {{natural_key_columns}}
HAVING COUNT(*) > 1
ORDER BY cnt DESC
LIMIT 100
```
- Run this query and report: total duplicate groups, total excess rows, and sample examples
2. Near-duplicate detection for string keys:
- Phonetic matching: Soundex or Metaphone for name fields
- MinHash LSH for large-scale fuzzy deduplication (scales to billions of rows)
- Blocking: reduce comparison space by only comparing within the same first-letter group or zip code
3. Deduplication strategy:
- Deterministic: define a priority rule for which record to keep (most recent, most complete, specific source)
- Write a SELECT DISTINCT-equivalent using ROW_NUMBER() to select the canonical record:
```sql
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY {{natural_key_columns}}
ORDER BY {{priority_column}} DESC
) AS rn
FROM {{table_name}}
)
SELECT * FROM ranked WHERE rn = 1
```
4. Prevention (upstream fix):
- Add a UNIQUE constraint if the warehouse supports it
- Add a pre-load duplicate check that blocks the load if duplicates are detected in the incoming batch
- Instrument the source system write path to prevent duplicates at origin
5. Monitoring:
- Add a daily duplicate count metric to the DQ dashboard
- Alert if duplicate count increases day-over-day
Return: exact duplicate query, ROW_NUMBER deduplication query, near-duplicate detection approach, and prevention implementation.When to use this prompt
When the natural key spans multiple columns.
When near-duplicate matching is needed for messy string fields.
When you need both cleanup logic and prevention controls.
What the AI should return
Return exact duplicate queries, canonical-record selection logic using window functions, an approach for near-duplicate detection at scale, and prevention recommendations. Include monitoring ideas and how to quantify duplicate groups and excess rows. The result should balance warehouse-side cleanup with upstream controls.
How to use this prompt
Open your data context
Load your dataset, notebook, or working environment so the AI can operate on the actual project context.
Copy the prompt text
Use the copy button above and paste the prompt into the AI assistant or prompt input area.
Review the output critically
Check whether the result matches your data, assumptions, and desired format before moving on.
Chain into the next prompt
Once you have the first result, continue deeper with related prompts in Data Quality.
Frequently asked questions
What does the Duplicate Detection at Scale prompt do?+
It gives you a structured data quality starting point for data engineer work and helps you move faster without starting from a blank page.
Who is this prompt for?+
It is designed for data engineer workflows and marked as intermediate, so it works well as a guided starting point for that level of experience.
What type of prompt is this?+
Duplicate Detection at Scale is a single prompt. You can copy it as-is, adapt it, or use it as one step inside a larger workflow.
Can I use this outside MLJAR Studio?+
Yes. The prompt text works in other AI tools too, but MLJAR Studio is the best fit when you want local execution, visible Python code, and reusable notebooks.
What should I open next?+
Natural next steps from here are Data Lineage Tracking, Data Quality Framework Chain, Data Quality Test Suite.