Better duplicate file detection

robertdana · August 5, 2008

Right now, Retrospect's duplicate file detection mechanism is really limited... files have to have the same name (and more troublesome... timestamp) to be considered the same.

It seems to me that, since you are already calculating MD5 checksums of files, you could implement a much more accurate (and tolerant) duplicate file detection mechanism.

Certainly keep the existing approach, but add a check for duplicate md5 checksums in the catalog after completing the backup of each file. If there's a checksum match, catalog it as a duplicate and throw away the unnecessary data. To reduce the catalog query overhead you might want to include a preference setting to disable this kind of matching on files below a configurable size threshold.

MRIS · August 31, 2008

+ 1 on this feature.

Performing MD5 comparisons is actually very efficient. I once designed a duplicate file detection system using nothing but MS-ACCESS and Microsoft's FCIV tool. http://support.microsoft.com/kb/841290

Although the FCIV scan took a few minutes, the actual queries to detect duplicate files using msaccess only took seconds.

The beauty of this feature is that files with identical content but with different file names can be backed up in such a way that the content is backed up once, and the different file names can all be kept etc.

Sign In

Better duplicate file detection

Recommended Posts

robertdana

Link to comment

Share on other sites

MRIS

Link to comment

Share on other sites

Join the conversation

Browse

Activity