Jump to content

TO ALL USERS: New feature proposal: Avoid redundant backup of renamed files.


Recommended Posts

To All Retrospect Users, can you please read the following feature suggestion and offer a +1 vote response if you would like to see this feature added in a future version of Retrospect? If you have time to send a +1 vote to the Retrospect support team, that would be even better. Thank you! 

 

I contacted Retrospect support and proposed a new feature which would avoid redundant backups of renamed files which are otherwise the same in content, date, size, attributes. Currently, Retrospect performs progressive backups, avoiding duplicates, if a file's name remains the same, even if the folder portion of the name has changed. However, if a file remains in the same folder location and is merely renamed, Retrospect will backup the file as if it's a new file, duplicating the data within the backup set. This costs time and disk space if a massive number of files are renamed but otherwise left unchanged, or if the same file (in content, date, size, attributes) appears in various places throughout a backup source under a different name. If this proposed feature is implemented, it would allow a Retrospect user to rename a file in a backup source which would not subsequently be redundantly backed up if the file's contents, date, size, attributes did not change (i.e., just a file name change doesn't cause a duplicate backup). 

 

I made this suggestion in light of renaming a bunch of large files that caused Retrospect to want to re-backup tons of stuff it had already backed up, merely because I changed the files' name. I actually mistakenly thought Retrospect's progressive backup avoided such duplication because I had observed Retrospect avoiding such duplication when changing a file's folder. For a folder name change, Retrospect is progressive and avoids duplicates, but if a file is renamed, Retrospect is not progressive and backs up a duplicate as if it's a completely new file.  

 

If you +1 vote this suggestion, you will be supporting the possible implementation of a feature that will let you rename files without incurring a duplicate backup of each renamed file. This can allow you to reorganize a large library of files with new names to your liking without having to re-backup the entire library. 

 

Thanks for you time in reading this feature suggestion.

  • Like 2
Link to comment
Share on other sites

The only way that Retrospect can check if it is the same file that has been renamed is to check the contents EVERY time it runs a backup and compare with the files already backed up (for instance through a checksum). That means every backup must read every byte from every file, which will take the same amount of time as a full backup.

 

Or am I missing something?

Link to comment
Share on other sites

The only way that Retrospect can check if it is the same file that has been renamed is to check the contents EVERY time it runs a backup and compare with the files already backed up (for instance through a checksum). That means every backup must read every byte from every file, which will take the same amount of time as a full backup.

 

Or am I missing something?

 

Why check every file? Currently Retrospect preforms a progressive backup and avoids duplicates by attempting to match the source file to the catalog merely on name, size, dates, and attributes.  If that match checking fails, Retrospect considers the file new and necessarily must read the entire file. As well, it by default generates an MD5 hash. So we know Retrospect must read all new files at least once (currently only once). This proposed feature has overhead of potentially reading such a source file twice, but that need can be optimized away with a little effort. Even without such optimizations, I would find this feature quite useful.

 

The feature I’m suggesting could easily be implemented by performing an advance extra read (before writing to the set) of any such new files that fail the initial duplicates checking. This advance extra read would be performed in order to generate the MD5 hash toward determining if a file is a duplicate. With that MD5, Retrospect can perform the same matching as it does initially for finding duplicates but instead of using name/size/dates/attributes, it would use MD5/size/dates/attributes. If that matches a file in the catalog, Retrospect avoids a duplicate, and instead inserts a reference to the existing file already in the set.

 

The overhead of the above suggestion is the so-called “advance extra read” that is required to produce the MD5 outside of Retrospect’s normal behavior of reading the file to produce that same MD5 while it also writes to the set. Yes, that is extra perf/overhead for new files but a few things on this…

 

First, it’s worth it as described above. I could have it always enabled and be happy with that in my usage. However, I could also enable it for backups that I know have a lot of renamed files, then disable afterwards. This would allow me to get a backup set in sync with a huge rename. But even better than worrying about enabling/disabling, there are some potential optimizations which can be added…

 

The key to avoiding the extra overhead is to avoid the extra source file read to produce the MD5 in advance which is needed to see if it’s a duplicate in content to anything already in the catalog. There are a number of ways which come to mind offhand that Retrospect could do this…

 

For example, as Retrospect encounters each file that appears to be new (which could be a renamed file that is actually a duplicate), Retrospect looks at the catalog information for all files that match that potential new source file’s date, size and attributes (without the name). If there are no matches, Retrospect proceeds forward as it normally would, adding the new file to the set (no extra overhead here except that catalog search which should be nil). If there are one or more matches of size/dates/attributes to files in the catalog, Retrospect then proceeds to generate an MD5 for the source file which it then uses to check the catalog for a match on MD5/dates/size/attributes. If there is a match, Retrospect considers the file to be a duplicate and proceeds forward without backing up the new copy (but the file will appear in the snapshot of course, just as with any progressive backup).

 

The example optimization just described is basically making sure the catalog has at least one matching file with the same dates/size/attributes before doing that extra read to produce the MD5. It seems all optimizations here are about avoiding that extra work. You need something strong, like that hash to do this content check, but you want to avoid that check if existing simpler data can be used for faster checks. But even more could be done…

 

Retrospect could maintain additional "simpler" data (than a hash) to help with further optimizations. For example, Retrospect could maintain a simple checksum for arbitrary but strictly defined sections of a file, such as checksums for a few sectors of the file's beginning, middle, and end. These could be added to the prior optimization’s check. For example, as mentioned before, given a potential duplicate/new source file’s size/dates/attributes, Retrospect could now check the catalog for one or more files with matching size/dates/attributes but also matches on those simple checksums, then (only if a catalog file match is found) proceed forward with the advanced (extra) read of the source file to produce the MD5 and perform final check with MD5/dates/size/attributes.

 

Again, I would LOVE this feature even without any such optimizations. The extra read would only occur for new files that fail all of Retrospect's current checks for changed files, or duplicate files allowing today's progressive backups. If all those fail to match, the extra read would then be required. So an extra read on all such new files to avoid duplicates on renames? To me... that's totally worth it. So the complexity I added above, which is not required for a good feature, is merely about finding creative ways to avoid that extra source read to get an MD5 to perform that MD5 check (by using checks that are faster and may eliminate the need to do the advance/extra file read to get that MD5).

 

Does this make sense?

 

I found it to be a nightmare that simply renaming files to converge several different naming conventions for a large library caused Retrospect to want to back up everything again which had been renamed even though only the names had changed. I actually already thought Retrospect did the above. Was surprised it did not. I would really like to see this feature added. Obviously they shouldn't do anything without understanding a worthwhile benefit beyond one user. I think they are open to suggestions but want to hear that other users would like this, I think.

Link to comment
Share on other sites

Just that although there would be a "duplicate file", the older file would be removed when the groom was run.

 

In my opinion, I would rather have a backup of files that are currently on my machine, since renaming one in effect does create a "new" file. I think it would be confusing and somewhat difficult to implement the feature you are suggesting.

 

Each to their own.

Link to comment
Share on other sites

Just that although there would be a "duplicate file", the older file would be removed when the groom was run.

 

In my opinion, I would rather have a backup of files that are currently on my machine, since renaming one in effect does create a "new" file. I think it would be confusing and somewhat difficult to implement the feature you are suggesting.

 

Each to their own.

 

There’s actually nothing inherently difficult about implementing this feature in a relative sense.  There are many backup solutions beginning to support/emphasize deduplication. Retrospect’s “progressive” backup is essentially a partial implementation of a full deduplication feature… it goes some distance in deduplication but falls short of some competing local and cloud backup solutions that offer full deduplication.

 

Don’t get me wrong, Retrospect’s concept of “progressive” backups has been something I’ve been appreciative of, and has been a strong point for some time, but using hashes and managing duplicates is not something out of the ordinary any longer… I feel “progressive backup” is no longer the precise novelty it used to be and I’m certainly feeling pain where it’s falling short.

 

My feature suggestion is really about adding an option users can activate to have Retrospect go the full distance. The only change really required for the simplest introduction of this feature is the initial thing I described: When Retrospect’s present-day “progressive backup” logic detects no matches (a supposed new file), generate a hash of that supposed new source file so a secondary and complete deduplication check can be performed beyond Retrospect’s partial present-day deduplication check. That’s it in its simplest form.

 

The following are some links relating to deduplication. If you prefer to web search instead of clicking links below, web search for "deduplication backup", "deduplication cloud backup", and you should find the Wikipedia and some other companies. I have no affiliation I’m aware of with any companies at the following links. These might help to highlight that deduplication is likely not difficult to achieve and may be worthwhile. I personally consider deduplication a CS102 task… it really shouldn’t be outside the realm of most software engineers evening starting out in this space.

 

https://en.wikipedia.org/wiki/Data_deduplication

https://www.google.com/#q=deduplicating+backup

https://www.google.com/#q=deduplicating+cloud+backup

 

http://www.acronis.com/en-us/resource-center/resource/deduplication/

https://www.druva.com/blog/why-global-dedupe-is-the-killer-feature-of-cloud-backup/

https://www.druva.com/public-cloud-native/scale-out-deduplication/

http://www.asigra.com/product/deduplication

https://www.barracuda.com/products/backup/features

https://borgbackup.readthedocs.io/en/stable/

http://zbackup.org/

https://attic-backup.org/

http://opendedup.org/odd/

 

Your needs sound different, unrelated to requirements that benefit from full deduplication. I don’t want to delete historical copies of files, yet I want to be able to rename files within the library without a backup storage penalty (disk, cloud, or otherwise). A full deduplication feature achieves this.

 

Today’s Retrospect partly does this already so it’s within its scope. It just breaks as soon as you rename enormous files or large numbers of files… because your backup set essentially grows by the size of all files renamed. Makes no sense, easily averted is my point. Grooming can somewhat solve the storage issue by harming historical integrity for the sake of freeing up space, a deal-breaker for me... I don't want to lose historical integrity. Even with grooming, though, I’d still have storage impacts to the degree I need to retain history that is not yet groomed. Unless I always groom everything but one copy, I’ll be impacted the same way, yet historical integrity will be ruined/erased. I just don’t see how grooming works to replace deduplication. (But I get that it works for you... that's great, but you don't have deduplication requirements. :)

Link to comment
Share on other sites

  • 3 weeks later...
  • 1 month later...

I'm sorry I'm late to posting in this thread, but I just thought of a problem with Ash7's proposal.  What if he/she has done the "renaming of a bunch of large files" that he/she reports in the third paragraph of post #1 in this thread, and then has something bad happen to the drive(s) on which those files are currently stored?  When he/she then Restores those files, are they restored with their current names, or with their original names?  If the files are Restored with their original names, then the administrator would lose the benefit of having "reorganize[d] a large library of files with new names" (fourth paragraph of post #1).  If the files are to be Restored with their current names, then Retrospect would have to have updated the Snapshot with the new names, and substitute those new names when Restoring.

 

But if the latter, what if another user saved the same files under different names, has to Restore them, but wants the Restored files to have the names that particular user gave them?  It seems to me the Snapshot would therefore have to have for each file a list of Also Known As names, each paired with the drive on which the particular Also Known As name is used.  That would be a significant change to Retrospect, which would make its Snapshots look like the files U. S. police departments have to maintain (I have never worked for a police department, but I've watched enough crime dramas to  know what "AKA" stands for).  In the U. S. it is AFAIK not illegal for a person to use more than one name (I used to be married to a woman who kept her previous married name for professional use, and also uses a made-up name for her artistic career), although AFAIK a person must use a single name for governmental records.  Please, Ash7, don't make Retrospect go there!

 

P.S.: Replaced Catalog File with Snapshot, because I think the list of Also Known As names would have to be placed specifically there.

Link to comment
Share on other sites

If the issues you outlined were potentials as a result of the suggested DeDuplication feature, they would already exist today in Retrospect since today Retrospect already does partial DeDuplication. Keep in mind I'm not suggesting anything new or novel, but rather asking that Retrospect fully implement DeDuplication rather than leave it partially implemented as it has been for a long time. 

 

Today, Retrospect properly DeDuplicates identical file across different folders so long as the file's name remains the same (by "name" I mean the trailing portion of the file's real full/complete name which would include its folder name... if the folder portion of the name changes, Retrospect current DeDuplicates today, that is, it implements the feature I'm proposing work uniformly/completely, not partially as it does today).

 

So you already have DeDuplication today and the issues you describe would not be new issues by virtue of implementing the feature suggestion. 

 

I don't have Retrospect's source code but I believe today the issue you outline is currently solved by restoring from a snapshot which retains the original file name at the time a backup is created. I believe in addition to the snapshot, today Retrospect likely retains info of a file's name (at the time of backup) within the set itself even if it skips backing up the file due to DeDuplication... I believe it must do this in order to rebuild a catalog file which it can do today.

Link to comment
Share on other sites

I think Ash7 has misunderstood the basic issue I raised in post #9 of this thread.  It is also a human factors issue, not just a feasibility issue.  Let me give you a contrived simple example:

 

Let's say that on 29 June there were a whole bunch of existing files on computers at Ash7's installation whose names were too messy for his/her tastes.  So on 30 June he/she prefixed all the objectionable file names with "AlphaProj" and then ran a backup.  If Ash7's Retrospect feature suggestion had been implemented, these files would not be re-backed-up, only their names would be changed in the Snapshot—of which a copy would be saved in the Catalog File on disk for his Backup Set and another copy saved on the Backup Set's latest media member (see the fifth paragraph, counting the bolded one, in this section of the old Wikipedia article).  Meanwhile, let's say that on 30 June one of Ash7's users—call her Rebellia—had simultaneously renamed her own client computer copy of one of the messily-named files, prefixing its name with "CharlieCust".  Since Ash7's Retrospect feature suggestion had been implemented, that existing file would not have been re-backed-up to the same Backup Set that Ash7's installation was using daily that last week in June; only a notation of its new name on that user's machine would have been made in the Snapshot for 30 June.  

 

Now let's say that at 1 a.m. on 1 July a water leak at Ash7's installation destroys both the external hard drive containing the Backup Set's Snapshot-containing Catalog and Rebellia's computer.  Let's say that Ash7 was running his backups directly to a Cloud Backup Set, so that the latest Backup Set member is safely offsite.  Ash7 returns to his/her installation at 9 a.m. on the morning of 1 July, and runs out to buy a replacement external hard drive for the "backup server" and a new computer for Rebellia.  Ash7 then rebuilds his Snapshot-containing Catalog File from the Cloud Backup Set, and uses it to Restore the disk on Rebellia's new computer.  Does the copy of the file on Rebellia's new computer end up with a name prefixed "CharlieCust", or does it end up with a name prefixed "AlphaProj"?

 

BTW, based on my further reading of the section of the Wikipedia article linked-to in the second paragraph of this post, I now think that the list of Also Known As names would be specifically in the Snapshot, not just in the Catalog File.  So I've changed post #9 in this thread accordingly.

Edited by DavidHertzberg
Wikipedia article has been significantly re-edited, but an old version was saved in which the paragraph with the bulleted items was deleted from the linked-to section
Link to comment
Share on other sites

... Does the copy of the file on Rebellia's new computer end up with a name prefixed "CharlieCust", or does it end up with a name prefixed "AlphaProj"?.

 

If you restore Rebellia's new replacement hard drive using the snapshot taken at the time Rebellia last backed up her computer (in your example, that would be June 30), the restored files on Rebellia's replacement disk would have the name that existed when the snapshot was created, in this case the names prefixed wiith "CharlieCust" per your example.

 

The need for any such such deduplication considerations already exists today without implementing the suggested feature. Since Retrospect performs partial deduplication today (on by default but it can be disabled), you would have a similar scenario if you restated it such that all users renamed (a.k.a. "moved") the folder location of a file. For example, if both UserA's disk and UserB's disk each have identical files in C:\SpecialPlace, where UserA renames it to C:\SpecialPlaceForUserA, and UserB renames it to C:\SpecialPlaceForUserB, you effectively have a rename which Retrospect deduplicates today. It is effectively the same scenario from the standpoint of evaluating disaster recovery scenarios. If UserB's system requires a replacement hard drive, UserB would end up with file names on the replacement drive based on the Snapshot used for restore. 

Link to comment
Share on other sites

I think Ash7 is assuming that each Snapshot entry for a file now contains the concatenated names of each of its enclosing directories.  I doubt that's true, because it would make each Snapshot entry very lengthy, and therefore the total size of a Snapshot very large.  Instead I suspect each Snapshot entry contains the concatenated entry-numbers-within-parent-directory of each of its enclosing directories.  Ash7 should take a careful look at pages 30-31 of the Retrospect Windows 12 User's Guide, and decide what he/she thinks "A Snapshot is a list—you can think of it as a picture—of all files and folders on a volume when it is backed up" means.

 

If I'm right, it would explain why Retrospect Inc. has not already implemented what Ash7 proposes—even as an option.  Note that the fifth paragraph of this section of the old Wikipedia article starts out "Retrospect does file-level deduplication, patented as IncrementalPLUS."  DavidBenAvraham's reference for that is page 21 of the Retrospect Mac 6 User's Guide, the oldest UG available on the Retrospect.com website.  Although Dantz Development Corp.'s patent has surely expired by now, you can't—at least you're not supposed to—get a U.S. software patent for something that is "obvious ... to a person having ordinary skill in the art to which the claimed invention pertains".  I suspect that what made IncrementalPLUS patentable is something rather tricky involving the Snapshot and Catalog File.

 

P.S.: Expanded second paragraph with evidence that Snapshots are trickier than Ash7 may think.

Edited by DavidHertzberg
Wikipedia article has been significantly re-edited, but an old version was saved in which the paragraph with the bulleted items was deleted from the linked-to section
Link to comment
Share on other sites

I did a Google search for "IncrementalPLUS" and "patents".  I found that some website connected with Amazon has a .PDF of the Retrospect Windows 6.5 User's Guide.  At the front of that 2003 manual were listed U.S. Patents 5,150,473 and 5,966,730; there were also trademark notices for a number of Dantz Retrospect terms.

 

Have fun researching those patents, Ash7.  I suspect that they contain the secrets of how Snapshots and Catalog Files interact to do Retrospect's IncrementalPLUS backups, which your proposed feature would modify.  When I get time and can find out how to view online at least a summary of a U.S. patent, I'll also research them.

Link to comment
Share on other sites

I think Ash7 is assuming ... I doubt that's true... If I'm right...

 

... I suspect ...

 

I think it's important not to negate the viability of the feature suggestion based on assumptions about the product's implementation. Putting aside assumptions about how the feature suggestion might be implemented or any related obstacles in that endeavor, what would be more interesting to hear is whether or not you would find this feature, if implemented/released, a value-add, a positive, and good thing... Do you give the feature suggestion a +1 vote or not? That's truly the purpose of this thread as the Retrospect team suggested I seek out user feedback... if they hear enough such feedback, they'll be more apt to consider the suggestion.

 

An aside... while I don't have access to Retrospect source code and don't want to make assumptions about its catalog file's implementation, I feel compelled to address your assumptions by saying that common sense tells me the catalog file's current design is either ripe for this feature or very easily extended to accommodate it in a backward compatible manner, that all of the issues you've raised so far are non-issues when considering whether or not you like the idea of this feature from a conceptual standpoint (regardless of Wikipedia, manuals, implementation/design assumptions, etc). This is all to repeat the question.

Link to comment
Share on other sites

I think it's important not to negate the viability of the feature suggestion based on assumptions about the product's implementation. Putting aside assumptions about how the feature suggestion might be implemented or any related obstacles in that endeavor, what would be more interesting to hear is whether or not you would find this feature, if implemented/released, a value-add, a positive, and good thing... Do you give the feature suggestion a +1 vote or not? That's truly the purpose of this thread as the Retrospect team suggested I seek out user feedback... if they hear enough such feedback, they'll be more apt to consider the suggestion.

 

An aside... while I don't have access to Retrospect source code and don't want to make assumptions about its catalog file's implementation, I feel compelled to address your assumptions by saying that common sense tells me the catalog file's current design is either ripe for this feature or very easily extended to accommodate it in a backward compatible manner, that all of the issues you've raised so far are non-issues when considering whether or not you like the idea of this feature from a conceptual standpoint (regardless of Wikipedia, manuals, implementation/design assumptions, etc). This is all to repeat the question.

 

 

It's all a question of trade-offs.  If the feature suggestion resulted in—e.g.—making Snapshots 10 times as large as they are now for all Retrospect administrators, I would definitely not find this feature a value-add.  But that's because I am a Retrospect administrator who does not personally engage in the renaming of massive numbers of files.

 

Therefore I have just engaged in a bit of research in the archives of the U.S. Patent Office, using the Web app http://patft.uspto.gov/ for which I found a link in the External Links at the bottom of the Wikipedia article "United States patent law".  First, you can forget about U.S. Patent 5,966,730; that turns out to be Dantz Development Corp.'s patent for Proactive backup.  So I looked at U.S. Patent 5,150,473; that appears to be Dantz Development Corp.'s patent (filed in 1990 and granted in 1992) for the interaction of the Snapshot and Catalog File.  Unfortunately I simply don't have the time to make sense of the flowchart-like explanation in what I can view; there was C code filed with the patent application as Appendix A, but it isn't stored online as part of this USPTO system.  However one of the prior patents cited as references is U.S. Patent 4,945,475, which turns out to be Apple's patent for the Hierarchical File System; a quick look says it works as I guessed in the first paragraph of post #13 in this thread.

 

So I did a simple calculation using the logged results of last Saturday's Recycle backup of my MacBook Pro.  I backed up 720,339 files, and the Snapshot contained 2 files totaling 229MB (building and copying the Snapshot reportedly took a bit less than 2 minutes).  That comes out to an average of 318 Snapshot bytes per file, which doesn't sound as if the names of parent directories are being stored in Snapshots.  However I can't tell if the names of the files themselves are being stored in Snapshots; it's possible that they are.

 

Thus, without further information on the question discussed in the last sentence of the preceding paragraph—for which the need is discussed in the first paragraph of this post, I'm afraid I can't give this feature suggestion a "+1 vote".  Sorry, Ash7.

Link to comment
Share on other sites

Okay, so I infer you see the feature suggestion as a value-add and are a +1 for it with the qualification its implementation not significantly impact Retrospect's catalog file size. Maybe I don't know the history of Retrospect's choices but I feel doubtful that the Retrospect team would implement a relatively straightforward feature addition like thisone which is more a tweak to the product that a complete revamp—in a manner which would cause impact causing a snapshot to be "10 times as large." 

 

My technical assessment of things is fairly decent in areas like this... I realize you don't know me, so I'm not expecting you to take this on faith but this feature suggestion can, with relative and reasonable certainty, be added to Retrospect with minimal impact to users' expectations of Retrospect both in performance as well as in resulting catalog/set size, and for users benefiting from feature, it could save lots of hours and lots hard disk space if not related cloud costs.

 

There's no need for me to see its source code or any highly technical documents of Retrospect's to make this assessment. It's based on very commonly available information on software systems, files, file-related information, cryptographic hash functions, and managing the persistence of such minutiae.

 

What I'm inferring here is not required for the original purpose of this thread, but to the extent we do get into discussing implementation pitfalls as reasons for avoiding this feature, I just want to do my best to ensure I chime in as strongly as is possible about the simplicity I observe here before me. If I don't do that it could lead other Retrospect users to either skip this thread (if they aren't already) or believe this feature represents a huge risky change to the software, or perhaps even cause the goal of the feature to be lost within the technical discussion. So I appreciate your focus on implementation-related concerns but I also have to respond equally if I don't see things the same way... hope that makes sense.

 

Consider media produced by family vacations, content creators, wedding photographers, and so many others... all from devices which produce funky names. Over time it's certain many will develop new wisdom about better ways to name things, folders to use. A file's name is sort of like a label on an old paper file. Simply changing the label on the file itself shouldn't force a user into storing the data in duplicate for any one file cabinet (any one backup set). If I have a 4TB library of such files, I shouldn't have to store 8TB merely because I "re-label" things.

 

I personally think Retrospect should just implement this feature suggestion simply because it takes the product in the right direction, to a nice place with little impact to their team, and creates a really nice value-add, one which saves lots of time and money for affected users.

 

Let me predict that full general deduplication will eventually become a common expectation of most any user of backup software. It may not be today, or next year... but perhaps that's part of the point... Retrospect has a following so implementing this feature today just keeps them ahead of the game by letting them catch up in some sense before others do it for them (as is happening already).

 

There's a movement going on right now with more and more people aware of content creation, cloud storage, and backup processes connected with all that. Renaming stored data should neither force a user to store in duplicate to a single backup nor force a user to ditch a backup set for a new one... the former option makes organizing efforts of users inefficient and costly while the latter greatly harms the implicit protective/guard features a lifetime backup set can provide (against human error, malware and the like). Both choices make it difficult for a user to reorganize "labels" as there's no good avenue.

  • Like 1
Link to comment
Share on other sites

Ash7, my basic concern—as someone with 40 years of professional programming experience—is that your feature suggestion seems so obvious that there must be some good reason why the Retrospect Inc. engineers didn't implement it years ago.  My feeling is that good programmers (which I believe the Retrospect Inc. engineers to be, although their ability to eliminate bugs before release and to fix them afterward seems not quite up to snuff—even allowing for Retrospect being a complicated system) have good reasons for what they do or don't do.  Therefore if Retrospect doesn't carry de-duplication down to the filename level, my tentative hypothesis is that it must be because the Snapshot doesn't currently directly contain filenames.  I was hoping to see C structs as part of U.S. Patent 5,150,473 in order to confirm or refute my hypothesis, but the code listings aren't in the Web-published document.

 

Also, because one of the things I have learned in the last few months of participating in these Forums is that many of the Retrospect administrators who post here are consultants to organizations, I had envisioned you as such an administrator with an obsessive-compulsive need to regularize filenames in organizational archives.  That's why my contrived example in post #11 in this thread was constructed the way it was.  I see now, from what you have written in the fifth paragraph of post #17, that you are more likely to be an administrator of a home installation who needs to regularize filenames etc. in a family archive.  Although my own installation is in my home, my 20-year marriage (childless by mutual agreement) ended 14 years ago—before the advent of digital photography.  I have an archive of still photos I took on a scenic trip we took in 1999, but it is sitting in a box in a closet—I have never digitized it.  My ex-wife took some photos on previous trips, but she has retained those.  Anyway, I am now much more sympathetic to the need for your suggested feature.

 

I presume that, because you wrote in post #1 of having contacted Retrospect Support, you have already filed a Support Case for this feature request.  If you haven't, please let me know here—so that my next post in this thread can be my boilerplate explanation of how to do that.

Link to comment
Share on other sites

A helpful side note to this discussion comes as a result of my reading an Ars Technica front page article on a patent troll.  The article linked to the patent being discussed using https://www.google.com/patents .  Obviously I couldn't resist using that URL to look up U.S. Patent 5,150,473.  The nice thing about that Google Web facility, as opposed to the Patent Office's own facility that I found in a Wikipedia article, is that it also lists patents that cite the numbered patent.

 

There are 9 U.S. Patents that cite 5,150,473, Besides 5,966,730, they include 8 other patents assigned to EMC Corporation—which owned the Retrospect software from 2004 to 2010.  Unfortunately they don't further explain the structure of the Snapshot.  The information retrieved via the Google search is still only what the Patent Office put on line; it's just formatted a bit more nicely, and provides links to other patents that cite the patent you're looking up.

 

Incidentally, the lead inventor on U.S. Patent 8,341,127—filed in 2006—is someone named Jeffery Gordon Heithcock.  The name sounds somewhat familiar; I wonder what happened to him?  That patent is for "client-initiated restore", which according to this section of the old Wikipedia article wasn't released until 2011 in Retrospect Mac 9.

Edited by DavidHertzberg
Wikipedia article has been significantly re-edited, but an old version was saved
Link to comment
Share on other sites

...my basic concern ... is that your feature suggestion seems so obvious that there must be some good reason why the Retrospect Inc. engineers didn't implement it years ago.  My feeling is that good programmers ... have good reasons for what they do or don't do. ...

 

Well... hopefully we all have good reasons for doing what we do. ;) ... Really though, I get what you're trying to convey ... in a purist sense it might seem that way but I suspect that is a fallacy because of the plethora of software out there which is wonderful and loved (or "good enough" and liked) by users who at the same time live with that software's kinks, missing gaps, and so forth. This seems more the norm than the exception. The imperfections aren't a sign the source code isn't well-crafted nor that the developers aren't talented. Things sometimes (if not usually) just evolve in ways that can leave something obvious missing for so many understandable reasons. 

 

It's funny... your inferred take on Retrospect's history for this gap is different from mine. I've always imaged Retrospect as having been originally created by developers who were tired of redundancy in backups caused by a simple rearranging of folders so they added folder-level deduplication. Whether or not folder-level (aka Progressive Backup) was part of the initial Retrospect might offer a clue into why the feature is the way it is. I can see various possibilities that have nothing to do with bad design or development... 

 

My best guess is that decades ago CPUs, RAM were far weaker/slower than today so costly hashing of an entire file had to be done judiciously. Including full deduplication would require re-reads of files in particular cases which by today's standards are quite manageable and outweighed by the cost and time required to manage the unnecessary redundancy caused by renaming several terabytes of files, if far less like a mere 100GB. (Post #3 discusses this.) There are other things that could have affected developers... the original implementation of folder-level-only deduplication, what we live with today, might have been deemed a good starting point for various reasons. 

 

... ... my tentative hypothesis is that it must be because the Snapshot doesn't currently directly contain filenames.  ...

 

Effectively it does contain the filenames... just view a catalog without the set present and you'll see file names... I believe you therefore can't mean just that so... When you say "directly" I believe you may be referring to potential abstractions between catalog structures or areas... if so, those are non-issues in the successful implementation of this feature in a backward compatible manner... and in a manner without significant/negative catalog size impact you'd like to avoid... feel free to post a concrete hypothetical if you want me to elaborate more specifically/technically... but I generally feel the catalog is conceptually simple enough that it isn't a big piece magic here... if you post a hypothetical with the worst catalog design you can conjure up, I can describe an easy path forward. The catalog and its size are not issues here... at least I'm not hearing anything to indicate such nor seeing it as a user with a good nose for that sort of thing.  

 

... Although my own installation is in my home, my 20-year marriage (childless by mutual agreement) ended 14 years ago—before the advent of digital photography.  I have an archive of still photos I took on a scenic trip we took in 1999, but it is sitting in a box in a closet—I have never digitized it.  My ex-wife took some photos on previous trips, but she has retained those.  Anyway, I am now much more sympathetic to the need for your suggested feature. ...

 

I'm sorry to hear about the marriage... those kinds of breakups are never fun... especially when you consider most any breakup isn't a fun thing. But you don't have any digital cameras?? ... get yourself a digital camera my friend! :D Well if you have any digital memories you cherish, or even if you're not sure but would like to save them in digital, flatbed scanners are fairly cheap these days... you can take an afternoon or two and capture things. ... and thank you for the sympathies on the feature... and for listening to what it was all about... wasn't sure anyone would see my post in this neck of the forum. LoL ... 

 

... I presume that, because you wrote in post #1 of having contacted Retrospect Support, you have already filed a Support Case for this feature request.  If you haven't, please let me know here—so that my next post in this thread can be my boilerplate explanation of how to do that.

 

I will double-check but I think they added it on a list... here is what they said...

 

... That being said, product management team did mention that these changes "would have been worthwhile if we had many users requiring this feature." and also "if we start hearing from enough users, but for now, it doesn't seem as much of a high-request." ... So while we do see the benefit of having such a feature, at this time we do not have enough demand for it to warrant the cost of implementation. ... The request has been put on a request list which is reviewed each year when determining which features will be added to a release. ...
 
If you think there's a more official thing to do let me know. 
 
Thanks again! 
Link to comment
Share on other sites

Ash7 may very well be correct as to the former slowness of CPUs and RAM explaining why Retrospect's developers originally considered full deduplication to be computationally too expensive; U.S. Patent 5,150,473 was filed by Dantz Development Corp. in 1990.  If you read the lead section of the old version of the Wikipedia article on Retrospect, especially the second and third paragraphs, you will see that the sales target for Retrospect shifted after Time Machine (and later the equivalent Windows facility) were developed to meet the backup needs of home installations.  Ash7 and I may be among the few home users of Retrospect left, which IMHO explains why Retrospect Inc. does not think it has "many users requiring this feature."

 

I mentioned the childlessness and breakup of my marriage to explain why I don't have a family archive with a lot of kids' pictures.  The scenic trip I took pictures of was in 1999, when digital cameras were still somewhat expensive and a bit exotic.  As a result of that trip I swore off taking my own pictures, because I found I was spending so much time planning the shots that I wasn't really looking at the sights.  Today I see so many people in my scenic part of Manhattan taking pictures (presumably for their Facebook pages), and wonder whether they are making the same mistake.  Anyway I now own a cheap digital camera, but I rarely use it; I'm not really a visual person.

Edited by DavidHertzberg
Wikipedia article has been significantly re-edited, but an old version was saved—and is now correctly linked-to with the proper paragraphs mentioned
  • Like 1
Link to comment
Share on other sites

It has been bugging me that the Google facility I mentioned in the first paragraph of post #19 in this thread only shows patents that reference the patent number you enter (well, you'd expect that—it's Google implementing the facility it thinks the world needs!).  To actually access the original U.S. patent 5,150,473, you have to browse https://www.google.com/patents/US5150473 (omitting the commas, naturally).

 

So, having figured this out, I took another look at U.S. patent 5,150,473.  It doesn't discuss the Snapshot, only the Catalog File.  If I click the leftmost image under Images, it shows Figs. 1-5—unfortunately Figs. 6-20 are not shown anywhere.  To the extent that I understand Figs. 2-5, each figure seems to show the format of a possible type of node within the tree shown in Fig. 1.  Fig. 4 is a File Info node, and it contains a single name.

 

IMHO, as a result of the calculation I discussed in my third paragraph of post #16, Snapshot nodes on the average aren't that large.  Therefore I infer that Snapshot nodes link to Catalog File nodes.  This in turn would mean that a Snapshot node would have to be enhanced to include a file name, if Ash7's feature suggestion were implemented, so that the file name in a Snapshot node could be different from the Catalog File node representing the last backup of that file—without a change to the Catalog File node that is almost certainly a no-no because you'd have to also change the copy of the Catalog File stored on the possibly-tape Backup Set medium (if you didn't make that change you'd create the problem I mentioned in post #9) .  That enhancement would enable renaming of files without their being re-backed-up, but would likely require a significant change in Retrospect's handling of Snapshots.  That's why I don't think Retrospect Inc. is likely to implement Ash7's feature suggestion, since—as I said at the end of the first paragraph of post #21—the feature is most likely to benefit the probably-few home users of Retrospect.

 

Of course the question of whether Snapshot nodes contain file names is best answered by someone actually taking a look at one.  Unfortunately, since I am not yet doing any programming on my home Macs, I don't know what "file dump" app to use to view data whose structure is unknown.  If someone reading this post is also familiar with Mac programming, maybe he/she could offer a suggestion?

Edited by DavidHertzberg
Put in links to posts referred to by number; I miss the numbering of posts that the Forums software used to do
Link to comment
Share on other sites

  • 10 months later...

I will +1 the feature suggestion. In fact, I want to take it further, down a path that other backup software systems HAVE successfully implemented:

- Create a hash for each block of each file (They can define the block size :) )

- Store unique content-blocks.

- All files with identical content will only be stored once. Doesn't matter what filename, date, attributes, etc.

  • Thanks 1
Link to comment
Share on other sites

13 hours ago, MrPete said:

I will +1 the feature suggestion. In fact, I want to take it further, down a path that other backup software systems HAVE successfully implemented:

- Create a hash for each block of each file (They can define the block size :) )

- Store unique content-blocks.

- All files with identical content will only be stored once. Doesn't matter what filename, date, attributes, etc.

Retrospect Inc. engineers discussed why they thought it unwise to implement block-based file deduplication in this 2014 Knowledge Base article, especially its second section.

Link to comment
Share on other sites

Thanks for that link, David. Sadly, their 2014 reasoning is quite dated only four years later.

Client system throughput improvements, hash calculation speedups, and the low cost of maintaining hash lists all combine to make block-based deduplication almost a no brainer today.

This is particularly true in light of:

* Greatly increased data file sizes

* Greatly increased use of VM's with common content yet rarely exactly common file attributes

For the last several years we used a backup solution that did block deduplication in the client (sadly, they have left the SMB market.) Such methods are wonderfully fast and efficient. Not only that: because all metadata is maintained separate from content, the metadata can be very efficiently scanned for reporting, filtering and recovery purposes. The time required in Retrospect to calculate grooming and restoration jobs would become miniscule.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...