The Duplicate Photo Problem at Scale
In a library of 50,000 photos accumulated over years of shooting, duplicate and near-duplicate images accumulate from multiple sources: cards imported twice, the same shoot copied to different folders, burst sequences never properly culled, and images exported at different sizes all stored together. Cleaning this up manually is impractical.
Types of Duplicates to Address
Exact File Duplicates
True exact duplicates have identical file content and are found by any deduplication tool using file hash comparison. The same RAW file accidentally imported twice will be caught by any hash-based tool.
Near-Duplicates (Burst Sequences)
Near-duplicates are frames from a burst sequence that are visually nearly identical but not byte-identical. Simple file hashing cannot find these; perceptual hashing tools are required.
Resized or Re-exported Versions
Full-resolution JPEGs and web-optimised versions of the same image stored together. Perceptual hashing can identify these as near-duplicates despite different file sizes.
Using imagic for Duplicate Removal
imagic uses perceptual hashing as part of its AI Analyse step. When you run a folder through imagic, it automatically groups near-identical images including burst sequences, slightly different exposures, and similar resized versions. In the Review step, each group is presented with the highest-quality frame pre-selected.
For existing large libraries with accumulated duplicates, run individual shoot folders through imagic's Import and Analyse steps to clean up old burst sequences never properly culled. Install with pip install imagic.
Additional Tools for Library Deduplication
digiKam includes a built-in duplicate finder using both exact and perceptual similarity detection. dupeGuru is a free, open-source duplicate file finder with a configurable image similarity mode.
A Cleanup Strategy for Large Libraries
- Start with exact duplicate removal using a file hash tool
- Run perceptual duplicate detection on your oldest unculled shoots
- Review near-duplicate groups manually before deleting; perceptual hashing occasionally produces false positives
- Move confirmed duplicates to a temporary holding folder before permanent deletion
- After 30 days, delete the holding folder
Going forward, use imagic's culling workflow on every new shoot to prevent duplicate accumulation at the source.