Jump to content

Compressio revisited


kaikow

Recommended Posts

Retrospect does not do a very good job of data compression.

Last night I ran a recylce backup that created 25.GB, but there was only 11% compression.

 

I'm using a disk backup set on a USB NTFS volume.

NTFS compression was not enabled.

 

How can I improve compression in such a situation.

 

P.S. As we speak, I am doing NTFS Compression on my other USB drive.

I'll try to remember to do a recycle backup on that drive tonight.

Link to comment
Share on other sites

Retrospect compresses files into its own proprietary format using a "lossless" compression process. This means that Retrospect compresses your data in such a way that all original bytes of your source data are preserved. This is similar in concept to the popular compression utilities PKZIP (Windows) and StuffIt (Macintosh), which compress and decompress preserving the original data byte for byte. If you back up a file using Retrospect's software data compression, and then you restore that file, you get back every byte, exactly as it was when you performed the backup.

 

In contrast, some compression methods such as MPEG, compress files using a "lossy" compression algorithm, which actually discards some content in favor of a smaller file. For obvious reasons, lossy algorithms aren't appropriate for backup applications, but work fine for many audio or video applications. Retrospect does not use these compression methods.

 

Keep in mind that compression relies on the types of files you're backing up. If you back up a lot of text files, these compress very well. If you back up a lot of application/video/audio/graphics files, these are already compressed and will not compress any further. Your capacity will therefore suffer.

Link to comment
Share on other sites

Quote:

Retrospect compresses files into its own proprietary format using a "lossless" compression process. This means that Retrospect compresses your data in such a way that all original bytes of your source data are preserved. This is similar in concept to the popular compression utilities PKZIP (Windows) and StuffIt (Macintosh), which compress and decompress preserving the original data byte for byte. If you back up a file using Retrospect's software data compression, and then you restore that file, you get back every byte, exactly as it was when you performed the backup.

 

In contrast, some compression methods such as MPEG, compress files using a "lossy" compression algorithm, which actually discards some content in favor of a smaller file. For obvious reasons, lossy algorithms aren't appropriate for backup applications, but work fine for many audio or video applications. Retrospect does not use these compression methods.

 

Keep in mind that compression relies on the types of files you're backing up. If you back up a lot of text files, these compress very well. If you back up a lot of application/video/audio/graphics files, these are already compressed and will not compress any further. Your capacity will therefore suffer.

 


 

I understand the technical issues, and Retrospect does do a better job on Normal backups.

It just seems that 11% on a full Recycle backup is low.

 

Using NTFS compression did save disk space, but slowed down the backup by about 20%.

Link to comment
Share on other sites

Perhaps you would consider posting a summary analysis of the contents of your drive, Howard. You could used a utility such as "treeprint" to export a detailed file list, including files in subdirectories, into a spreadsheet. Then, sort by file extension and do a subtotal of file size by file extension, to see what % of the files being backed up are exe/jpg/etc.

Link to comment
Share on other sites

I've thought about doing that, but there are 10 logical drives with a few hundred thousand files.

 

I guess that I could write a program to do this, I'll think about it.

 

 

The files on my my system are likely very typical of someone doing VB/VBA programming development.

 

I have 3 SCSI hard drives:

 

1: Has C and D with a Win 2000 installed on C

2. Has F-I with a Win 2000 installed on F and another Win 2000 installed on G

3. Has J-M with a Winm 2000 installed on J.

 

I use Retrospect with the Win 2000 installed on J.

 

Each of the Win 2000's has a different version of Office:

C has Office 97

F has Office 2000

G has Office XP

J has Office 2003.

 

Each Win 2000 has VB 6 Enterprise.

EAch Win 2000 has the October 2001 MSDN Library, but less files are installed for C, F, and G, than on J.

 

G has VS .Net Pro 2002.

J has VS .Net Pro 2003.

 

All have IE 6 and OE 6.

 

At least the following are shared over all 4 systems:

 

Eudora mailbox directory

OE mailboxes

Favorites

Cookies

Recent Files

My Documents

Temporary Internet Files

 

I do not play with music/audio/video, so I expect that I have little more than default installations of such files.

Link to comment
Share on other sites

Quote:

Perhaps you would consider posting a summary analysis of the contents of your drive, Howard. You could used a utility such as "treeprint" to export a detailed file list, including files in subdirectories, into a spreadsheet. Then, sort by file extension and do a subtotal of file size by file extension, to see what % of the files being backed up are exe/jpg/etc.

 


 

I just did that, first the good news, then the bad news.

 

1. I was amazed to see that my used disk space had grown from around 18GB last June to anout 31GB now.

 

2. Sorting by file extensions is not a useful exercise. The information is ambiguous. A .exe file, as well as a .dll as well as a ... are highly compressible. Most of the work I do involves VB/VBA programming, using Office The files involved are highly compressible.

 

If Retrospect is choosing which files to compress based on the file extension, that's a faulty algorithm. At a minimum, they must distinguish between real .exe files and those that are really just a form of .zip or some such archive file, which is already likely highly compressed.

Only 11% compression for a Recycle backup is just too low and makes me suspicious as to how Retrospect chooses which files to compress and even how the compression is accomplished.

 

I wonder if the compression would be better if were silly enough to rename all my .exe files as .txt?.

 

Compared to most folkes, I am sure that I have far fewer non-compressible audio/image files. I don't give a damn about such files.

Link to comment
Share on other sites

I did an experiment and put a VB 6 created .exe on a drive that had no other files, other than the Recycle bin

 

I then did an Immediate backup of just that drive and the compression was 58%, which is satisfactory.

 

So if Retropspect is properly compressing .exe, .dll. Word/Excel/VB 6/VS .NET/HTML/Text/... files, frankly I don't understand why compression for a Recyle backup would only be 11%.

 

Retrospect must be refusing to compress a heck of a lot of files.

The only way to solve this mystery would be to have an optional reporting capability that lists each file anf the percentage compression.

 

This information does appear to be present in the session logs if I right-click on each file and select properties.

 

Is there a way to print a report showing the needed info?

If not, is there a Retrospect SDK allowing access via VB/VBA to the info?

I'd print it to a file, rather than kill a tree and I could process the file with VB/VBA.

Link to comment
Share on other sites

I see what is the problem:

 

Retrospect is refusing to compress certain files that it believes that it cannot compress or for which the compression is felt to slow down the backup more than the value of the space saved. It should be my choice if I wish to take a performance hit.

 

Allow the user to select from at least the following options:

 

a. No change from current implementation, and document which files are to be compressed.

b. No compression.

c. Compress ALL files, no exceptions, other than those for which it can be proven that compression will save 0%. For example, tho the saving is small, PDF files can be compressed, so if the user chooses this option, then the file will be compressed.

d. Compress all but specific file types identified by user.

 

Include an option to print out the session logs, including the %compressed for each file.

Link to comment
Share on other sites

Here's further evidence that retrospect is skipping files that should be compressed.

 

I did a recyle back up a few hours ago.

 

The following is for the D drive:

 

- 3/12/2004 18:18:31: Copying Micrond (D:)

3/12/2004 18:21:22: Snapshot stored, 12.1 MB

3/12/2004 18:21:25: Comparing Micrond (D:)

3/12/2004 18:21:27: Execution completed successfully

Completed: 6 files, 187 KB, with 18% compression

Performance: 2.7 MB/minute (1.5 copy, 10.9 compare)

Duration: 00:02:56 (00:02:46 idle/loading/preparing)

 

18 % is far too low.

 

The 6 files were:

 

1. 6227 byte .txt file that was compressed 66%.

2. 22016 byte .xls file that was compressed 65%.

3. 34304 byte .xls file that was compressed 65%.

4. 11335 byte .pdf file that was not compressed.

5. 76487 byte .pdf file that was not compressed.

6. 37397 byte .pdf file that was not compressed.

 

Using a zip compression program, the pdf files compress 23%, 54% and 50%.

So the overall compression for the D drive should have much higher.

 

In another session, there was a 409368 byte .pdf file that was not compressed by retrospect.

Using a zip compression, the file gets compressed 39%.

 

Retrospect is making poor choices on which files to compress.

Link to comment
Share on other sites

I just performed the following experiment.

 

1. On the D drive, I made a copy of each of the three PDF files that retrospect did not compress.

2. To fool retrospect, I changed the file name extensions from PDF to TXT.

 

3. I then ran an Immediate back on the D drive.

- 3/13/2004 15:41:15: Copying Micrond (D:)

3/13/2004 15:42:01: Snapshot stored, 12.1 MB

3/13/2004 15:42:04: Comparing Micrond (D:)

3/13/2004 15:42:07: Execution completed successfully

Completed: 3 files, 124 KB, with 6% compression

Performance: 1.4 MB/minute (0.8 copy, 7.2 compare)

Duration: 00:00:51 (00:00:40 idle/loading/preparing)

 

Quit at 3/13/2004 15:44

This time retrospect tried to compress the files, but 6% is far too low.

 

The 3 files were compressed by retrospect 0%, 16%, and 22%.

Zip compression is 23%, 54% and 50%.

 

So there are two issues:

 

1. Retrospect is refusing to compress files that can be compressesd.

 

2. I believe that zip compression is lossless, but I'll check into that later. In any case, there may be a need to also improve the compression algorithm used.

 

The issues may be addressed independently and issue 1 should be subject to a quick fix as it has no side effects, just decreases disk space used. Users can be given the choice of the performance options I mentioned in an earlier post. Properly implemented, such options should have no side effects.

Link to comment
Share on other sites

Dantz culpa!

 

and

 

Mea culpa!

 

I was very disappointed to discover today that Dantz had failed to respond in his tread with the information that there is a built-in "Compression Selector" this is used to determine which file types will be compressed, so we have Dantz culpa!

 

However, we also have Mea culpa, because back on 10 July2003, Dantz did point out the existenxe of such a selector and I had forgotten about that until today.

 

However, there are still some culpa awards to be handed out:

 

1. Retrospect is not compressing several file types that are NOT listed in the built-in compression selector. These include the HxS and HxI file types which are in the MSDN library and do compress, albeit, it appears by less than 10% using zip compression. The compression selector needs to be made more complete so we can more accurately select the file types to include/exclude for compression. the seledctor canno, of course, list ALL file types, but it sure should be able to list all those used for well known products such as the MSDN library.

 

2. Why does retrospect NOT compress files that do NOT have their file types listed in the compression selector? By default, any type not listed in the compression selector is supposed to be included due to the "Include everything" selection rule.

 

3. As I pointed out over the past few daze, one can trick retrospect into compressing, say, .PDF files by changing them to .TXT files. Of course now that I have (re)discovered the compression selector, I can instead modify the selector. However, as I've also shown, the retrospect compression algorithm does far worse than the algorithm used by a typical .zip program (in this case, Zip Central), so there needs to be some investigation towards inproving the algorithm.

 

4. The compression selector also excludes the compression of other file types, such as JPEG that can be compressed. For example, I just used a zip program on 4 JPEG files and got compressions of 13, 15, 15, and 19 percent.

 

5. Although te compression selector might accomplish my goal, I am concerned because retrospect does not compress certain file types, such as HxS and HxI, that are not excluded by the compression selector.

 

6. A useful choice might be to compress ALL files, other than archive files, such as ZIP, SIT, GZ, etc., but still compress ALL application files such as PDF, EPS, JPEG, etc.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...