Jump to content
Sign in to follow this  

Better duplicate file detection

Recommended Posts

Right now, Retrospect's duplicate file detection mechanism is really limited... files have to have the same name (and more troublesome... timestamp) to be considered the same.


It seems to me that, since you are already calculating MD5 checksums of files, you could implement a much more accurate (and tolerant) duplicate file detection mechanism.


Certainly keep the existing approach, but add a check for duplicate md5 checksums in the catalog after completing the backup of each file. If there's a checksum match, catalog it as a duplicate and throw away the unnecessary data. To reduce the catalog query overhead you might want to include a preference setting to disable this kind of matching on files below a configurable size threshold.

Share this post

Link to post
Share on other sites

+ 1 on this feature.


Performing MD5 comparisons is actually very efficient. I once designed a duplicate file detection system using nothing but MS-ACCESS and Microsoft's FCIV tool. http://support.microsoft.com/kb/841290


Although the FCIV scan took a few minutes, the actual queries to detect duplicate files using msaccess only took seconds.


The beauty of this feature is that files with identical content but with different file names can be backed up in such a way that the content is backed up once, and the different file names can all be kept etc.

Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this