Jump to content

Better duplicate file detection


Recommended Posts

Right now, Retrospect's duplicate file detection mechanism is really limited... files have to have the same name (and more troublesome... timestamp) to be considered the same.

 

It seems to me that, since you are already calculating MD5 checksums of files, you could implement a much more accurate (and tolerant) duplicate file detection mechanism.

 

Certainly keep the existing approach, but add a check for duplicate md5 checksums in the catalog after completing the backup of each file. If there's a checksum match, catalog it as a duplicate and throw away the unnecessary data. To reduce the catalog query overhead you might want to include a preference setting to disable this kind of matching on files below a configurable size threshold.

Link to comment
Share on other sites

  • 4 weeks later...

+ 1 on this feature.

 

Performing MD5 comparisons is actually very efficient. I once designed a duplicate file detection system using nothing but MS-ACCESS and Microsoft's FCIV tool. http://support.microsoft.com/kb/841290

 

Although the FCIV scan took a few minutes, the actual queries to detect duplicate files using msaccess only took seconds.

 

The beauty of this feature is that files with identical content but with different file names can be backed up in such a way that the content is backed up once, and the different file names can all be kept etc.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...