Jump to content


Photo

TO ALL USERS: New feature proposal: Avoid redundant backup of renamed files.

redundant md5 digest hash backup file rename duplicate duplicates avoid

  • Please log in to reply
7 replies to this topic

#1 Ash7

Ash7

    Newbie

  • Members
  • 6 posts

Posted 09 May 2017 - 09:05 PM

To All Retrospect Users, can you please read the following feature suggestion and offer a +1 vote response if you would like to see this feature added in a future version of Retrospect? If you have time to send a +1 vote to the Retrospect support team, that would be even better. Thank you! 

 

I contacted Retrospect support and proposed a new feature which would avoid redundant backups of renamed files which are otherwise the same in content, date, size, attributes. Currently, Retrospect performs progressive backups, avoiding duplicates, if a file's name remains the same, even if the folder portion of the name has changed. However, if a file remains in the same folder location and is merely renamed, Retrospect will backup the file as if it's a new file, duplicating the data within the backup set. This costs time and disk space if a massive number of files are renamed but otherwise left unchanged, or if the same file (in content, date, size, attributes) appears in various places throughout a backup source under a different name. If this proposed feature is implemented, it would allow a Retrospect user to rename a file in a backup source which would not subsequently be redundantly backed up if the file's contents, date, size, attributes did not change (i.e., just a file name change doesn't cause a duplicate backup). 

 

I made this suggestion in light of renaming a bunch of large files that caused Retrospect to want to re-backup tons of stuff it had already backed up, merely because I changed the files' name. I actually mistakenly thought Retrospect's progressive backup avoided such duplication because I had observed Retrospect avoiding such duplication when changing a file's folder. For a folder name change, Retrospect is progressive and avoids duplicates, but if a file is renamed, Retrospect is not progressive and backs up a duplicate as if it's a completely new file.  

 

If you +1 vote this suggestion, you will be supporting the possible implementation of a feature that will let you rename files without incurring a duplicate backup of each renamed file. This can allow you to reorganize a large library of files with new names to your liking without having to re-backup the entire library. 

 

Thanks for you time in reading this feature suggestion.


  • CAPRI - Agustín Hernández likes this

#2 Lennart Thelander

Lennart Thelander

    Retrospect Veteran

  • Members
  • 3,529 posts
  • LocationHelsingborg, Sweden

Posted 10 May 2017 - 03:57 PM

The only way that Retrospect can check if it is the same file that has been renamed is to check the contents EVERY time it runs a backup and compare with the files already backed up (for instance through a checksum). That means every backup must read every byte from every file, which will take the same amount of time as a full backup.

 

Or am I missing something?



#3 Ash7

Ash7

    Newbie

  • Members
  • 6 posts

Posted 11 May 2017 - 03:32 AM

The only way that Retrospect can check if it is the same file that has been renamed is to check the contents EVERY time it runs a backup and compare with the files already backed up (for instance through a checksum). That means every backup must read every byte from every file, which will take the same amount of time as a full backup.

 

Or am I missing something?

 

Why check every file? Currently Retrospect preforms a progressive backup and avoids duplicates by attempting to match the source file to the catalog merely on name, size, dates, and attributes.  If that match checking fails, Retrospect considers the file new and necessarily must read the entire file. As well, it by default generates an MD5 hash. So we know Retrospect must read all new files at least once (currently only once). This proposed feature has overhead of potentially reading such a source file twice, but that need can be optimized away with a little effort. Even without such optimizations, I would find this feature quite useful.

 

The feature I’m suggesting could easily be implemented by performing an advance extra read (before writing to the set) of any such new files that fail the initial duplicates checking. This advance extra read would be performed in order to generate the MD5 hash toward determining if a file is a duplicate. With that MD5, Retrospect can perform the same matching as it does initially for finding duplicates but instead of using name/size/dates/attributes, it would use MD5/size/dates/attributes. If that matches a file in the catalog, Retrospect avoids a duplicate, and instead inserts a reference to the existing file already in the set.

 

The overhead of the above suggestion is the so-called “advance extra read” that is required to produce the MD5 outside of Retrospect’s normal behavior of reading the file to produce that same MD5 while it also writes to the set. Yes, that is extra perf/overhead for new files but a few things on this…

 

First, it’s worth it as described above. I could have it always enabled and be happy with that in my usage. However, I could also enable it for backups that I know have a lot of renamed files, then disable afterwards. This would allow me to get a backup set in sync with a huge rename. But even better than worrying about enabling/disabling, there are some potential optimizations which can be added…

 

The key to avoiding the extra overhead is to avoid the extra source file read to produce the MD5 in advance which is needed to see if it’s a duplicate in content to anything already in the catalog. There are a number of ways which come to mind offhand that Retrospect could do this…

 

For example, as Retrospect encounters each file that appears to be new (which could be a renamed file that is actually a duplicate), Retrospect looks at the catalog information for all files that match that potential new source file’s date, size and attributes (without the name). If there are no matches, Retrospect proceeds forward as it normally would, adding the new file to the set (no extra overhead here except that catalog search which should be nil). If there are one or more matches of size/dates/attributes to files in the catalog, Retrospect then proceeds to generate an MD5 for the source file which it then uses to check the catalog for a match on MD5/dates/size/attributes. If there is a match, Retrospect considers the file to be a duplicate and proceeds forward without backing up the new copy (but the file will appear in the snapshot of course, just as with any progressive backup).

 

The example optimization just described is basically making sure the catalog has at least one matching file with the same dates/size/attributes before doing that extra read to produce the MD5. It seems all optimizations here are about avoiding that extra work. You need something strong, like that hash to do this content check, but you want to avoid that check if existing simpler data can be used for faster checks. But even more could be done…

 

Retrospect could maintain additional "simpler" data (than a hash) to help with further optimizations. For example, Retrospect could maintain a simple checksum for arbitrary but strictly defined sections of a file, such as checksums for a few sectors of the file's beginning, middle, and end. These could be added to the prior optimization’s check. For example, as mentioned before, given a potential duplicate/new source file’s size/dates/attributes, Retrospect could now check the catalog for one or more files with matching size/dates/attributes but also matches on those simple checksums, then (only if a catalog file match is found) proceed forward with the advanced (extra) read of the source file to produce the MD5 and perform final check with MD5/dates/size/attributes.

 

Again, I would LOVE this feature even without any such optimizations. The extra read would only occur for new files that fail all of Retrospect's current checks for changed files, or duplicate files allowing today's progressive backups. If all those fail to match, the extra read would then be required. So an extra read on all such new files to avoid duplicates on renames? To me... that's totally worth it. So the complexity I added above, which is not required for a good feature, is merely about finding creative ways to avoid that extra source read to get an MD5 to perform that MD5 check (by using checks that are faster and may eliminate the need to do the advance/extra file read to get that MD5).

 

Does this make sense?

 

I found it to be a nightmare that simply renaming files to converge several different naming conventions for a large library caused Retrospect to want to back up everything again which had been renamed even though only the names had changed. I actually already thought Retrospect did the above. Was surprised it did not. I would really like to see this feature added. Obviously they shouldn't do anything without understanding a worthwhile benefit beyond one user. I think they are open to suggestions but want to hear that other users would like this, I think.



#4 Lucky_Phil

Lucky_Phil

    Advanced Member

  • Members
  • PipPipPip
  • 45 posts

Posted 11 May 2017 - 02:09 PM

In my opinion, scheduling more frequent groom jobs would be better than having this proposed feature.



#5 Ash7

Ash7

    Newbie

  • Members
  • 6 posts

Posted 11 May 2017 - 03:36 PM

In my opinion, scheduling more frequent groom jobs would be better than having this proposed feature.

 

I can't see how grooming solves the same problem... Can you elaborate? 



#6 Lucky_Phil

Lucky_Phil

    Advanced Member

  • Members
  • PipPipPip
  • 45 posts

Posted 12 May 2017 - 07:38 AM

Just that although there would be a "duplicate file", the older file would be removed when the groom was run.

 

In my opinion, I would rather have a backup of files that are currently on my machine, since renaming one in effect does create a "new" file. I think it would be confusing and somewhat difficult to implement the feature you are suggesting.

 

Each to their own.



#7 Ash7

Ash7

    Newbie

  • Members
  • 6 posts

Posted 12 May 2017 - 12:37 PM

Just that although there would be a "duplicate file", the older file would be removed when the groom was run.

 

In my opinion, I would rather have a backup of files that are currently on my machine, since renaming one in effect does create a "new" file. I think it would be confusing and somewhat difficult to implement the feature you are suggesting.

 

Each to their own.

 

There’s actually nothing inherently difficult about implementing this feature in a relative sense.  There are many backup solutions beginning to support/emphasize deduplication. Retrospect’s “progressive” backup is essentially a partial implementation of a full deduplication feature… it goes some distance in deduplication but falls short of some competing local and cloud backup solutions that offer full deduplication.

 

Don’t get me wrong, Retrospect’s concept of “progressive” backups has been something I’ve been appreciative of, and has been a strong point for some time, but using hashes and managing duplicates is not something out of the ordinary any longer… I feel “progressive backup” is no longer the precise novelty it used to be and I’m certainly feeling pain where it’s falling short.

 

My feature suggestion is really about adding an option users can activate to have Retrospect go the full distance. The only change really required for the simplest introduction of this feature is the initial thing I described: When Retrospect’s present-day “progressive backup” logic detects no matches (a supposed new file), generate a hash of that supposed new source file so a secondary and complete deduplication check can be performed beyond Retrospect’s partial present-day deduplication check. That’s it in its simplest form.

 

The following are some links relating to deduplication. If you prefer to web search instead of clicking links below, web search for "deduplication backup", "deduplication cloud backup", and you should find the Wikipedia and some other companies. I have no affiliation I’m aware of with any companies at the following links. These might help to highlight that deduplication is likely not difficult to achieve and may be worthwhile. I personally consider deduplication a CS102 task… it really shouldn’t be outside the realm of most software engineers evening starting out in this space.

 

https://en.wikipedia...a_deduplication

https://www.google.c...licating backup

https://www.google.c...ng cloud backup

 

http://www.acronis.c.../deduplication/

https://www.druva.co...f-cloud-backup/

https://www.druva.co...-deduplication/

http://www.asigra.co...t/deduplication

https://www.barracud...backup/features

https://borgbackup.r...s.io/en/stable/

http://zbackup.org/

https://attic-backup.org/

http://opendedup.org/odd/

 

Your needs sound different, unrelated to requirements that benefit from full deduplication. I don’t want to delete historical copies of files, yet I want to be able to rename files within the library without a backup storage penalty (disk, cloud, or otherwise). A full deduplication feature achieves this.

 

Today’s Retrospect partly does this already so it’s within its scope. It just breaks as soon as you rename enormous files or large numbers of files… because your backup set essentially grows by the size of all files renamed. Makes no sense, easily averted is my point. Grooming can somewhat solve the storage issue by harming historical integrity for the sake of freeing up space, a deal-breaker for me... I don't want to lose historical integrity. Even with grooming, though, I’d still have storage impacts to the degree I need to retain history that is not yet groomed. Unless I always groom everything but one copy, I’ll be impacted the same way, yet historical integrity will be ruined/erased. I just don’t see how grooming works to replace deduplication. (But I get that it works for you... that's great, but you don't have deduplication requirements. :)



#8 CAPRI - Agustín Hernández

CAPRI - Agustín Hernández

    Unable To Update To 7.7.562 From 7.7.341

  • Members
  • PipPip
  • 23 posts
  • LocationSpain

Posted 30 May 2017 - 01:23 PM

Request supported.


Agustín Hernández





Also tagged with one or more of these keywords: redundant, md5, digest, hash, backup, file, rename, duplicate, duplicates, avoid

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users