Jump to content
Don Lee

Nasty bug - scheduled "Copy media set" only copies fraction of data

Recommended Posts

I have seen this before, but I thought it was pilot error.  Now that Retro 10 does not allow me to "accidentally" change my scripts, I am now certain that this bug is real, and it's nasty.

 

I have have scripts that "rotate" the backups on the first of the month.  I have the options set so that the media set X is copied to X_last, recycliing X_last first, and recycling X if the copy is successful.  When I do this manually, it seems to work fine. (Doing this I can "rotate" my media sets without changing the scripts.)

 

This last June 1st, these scripts ran, and my 50 GB media sets produced X_last copies that are only 200 MBytes!!!  I don't know exactly what it's copying, but it's only copying a small fraction of the data I wanted copied.

 

Note that I am copying from a disk media set to a file media set.

 

Retro is throwing my data away.  Is this a known bug?  Is there a workaround?  Do I have to run these rotations manually to be safe?

 

Retro 10.1.0 (221)

engine on Mac OS X 10.8.3 - Mac Mini with 16 GB memory

  • Like 1

Share this post


Link to post
Share on other sites

FYI - the log from the operation that "copied" the media set, turning 50 GB into 300 MBytes

 

 

+ Executing Monthly_iCompute_rotate at 6/1/13 (Activity Thread 2)
To Backup Set v9_iCompute_last...
6/1/13 12:01:01 AM: Recycle backup: The Backup Set was reset
- 6/1/13 12:01:05 AM: Transferring from v9_iCompute
6/1/13 12:01:19 AM: Execution completed successfully
Completed: 1179 files, 297 MB
Performance: 1,370.7 MB/minute
Duration: 00:00:14 (00:00:01 idle/loading/preparing)

- 6/1/13 12:01:19 AM: Verifying v9_iCompute_last
6/1/13 12:01:19 AM: Execution completed successfully

! 6/1/13: Recycle of Media Set v9_iCompute

Share this post


Link to post
Share on other sites

With my July "rotation", this happened again.  My rotation from current sets to the "previous" sets have produced stunted backup sets.

 

This is a serious problem.  My current sets appear reliable, but if I use the "copy media set" function in a script ,it randomly fails without ANY indication of failure, unless you look at the output and check it.  Logs are clean.  It's just that the size of the output is impossibly small, and the number of files copied is impossibly small.

 

The worst part is that the operation that triggers the lost data is the "recycle source on successful copy".  If I leave that unchecked, the copy seems to be OK.  This means that if I use that option, I lose BOTH my current, and my previous months' backups!  Not using that option simply means I have to go in and recycle the current set(s) manually.

 

 

I am surprised that no one has commented on this.  Is no one else copying media sets?  How do others "rotate" their sets so that they don't have to screw around with their scripts to re-connect to the new media sets?

Share this post


Link to post
Share on other sites

I just tried a restore from one of my "stunted" copies, and they restore - silently - just a fraction of the snapshot volume.  No errors.  No complaints, just lots  and lots of missing files.  There is no indication at all that anything is amiss, even though it only restores 4 out of 50 GB of data expected.

 

???????

Share this post


Link to post
Share on other sites

I have another data point.  I ran my monthly automated "rotation" on Sep 1, and some of the media set copies came out stunted, as before.  The difference is that I am NOT using the "recycle source on success" option.  I'm sure.  There is something more sinister going on.

 

I don't *think* I've ever seen this except when the operation was scheduled, FWIW.

Share this post


Link to post
Share on other sites

October rotations have started, and I'm seeing stunted media set copies again.  I enclose 2 screen shots - of the source media set, and the destination set.  The latter is a small fraction of the former, and only has about half the snapshots (backups) as seen in the console.These always seem to work when I run them "manually".  Only when scripted and scheduled do they screw up.

post-24400-0-35327300-1380605926_thumb.png

post-24400-0-26574600-1380605939_thumb.png

Share this post


Link to post
Share on other sites

I'm walking through my monthly rotations, and have noticed that most f the media set copy operations say they copied a LOT more data then they verified.  This one copied 56 GB, and the verify operation took ** 4 seconds**.  I don't think it did much.  The log:

 

 

+ Executing rotate_Skinnie at 10/1/13 (Activity Thread 1)
To Backup Set Skinnie_last...
10/1/13 2:47:55 AM: Recycle backup: The Backup Set was reset
- 10/1/13 2:47:58 AM: Transferring from Skinnie
10/1/13 4:44:23 AM: Execution completed successfully
Completed: 594740 files, 56 GB
Performance: 500.1 MB/minute
Duration: 01:56:24 (00:01:42 idle/loading/preparing)

- 10/1/13 4:44:23 AM: Verifying Skinnie_last
10/1/13 4:44:27 AM: Execution completed successfully
Duration: 00:00:04

 

I'm thinking that this must be some variation of the "verify whole media set" option on the verify script.  The verify of the copy must be deciding for some reason it only needs to verify part of the copied media set.

 

There is no option on the "media set copy" script to "verify all".

 

What's up with this?  It makes me very nervous about my resulting media set copies.

Share this post


Link to post
Share on other sites

I made a change in my Nov 1 rotations. I broke the rotate script into two parts.  Part 1 simply recycles the target media set. Part 2 does the media set copy.

 

This works reliably.

 

I draw a conclusion from this.  The option to "recycle destination media set" at initiation of copy op does not work properly. When I use it, it screws up the destination set, and results in a stunted copy of the media set.  If I do the recycle as a separate op, then everything works.

Share this post


Link to post
Share on other sites

Sorry for not replying to this thread sooner. We've tried to reproduce this issue a number of times, but we haven't been able to. Could you contact our support team to work with them on it? Please reference issue 4131 and highlight the Nov 1 script change.

Share this post


Link to post
Share on other sites

FWIW, I have contacted support, and re-run some tests on 10.5. So far (so far....) I cannot reproduce the problem with 10.5.

 

I would normally insist on reproducing the problem on the last version that exhibited the problem, but I only have one machine that can run the server, and it's a production machine, so I won't be doing that.

 

I'll keep trying, but it looks like it might be fixed.

Share this post


Link to post
Share on other sites

That's excellent news. I spoke with support, and we can't reproduce it on the latest release either. Let me know if the issue comes up again.

Share this post


Link to post
Share on other sites

May I ask if you know what was fixed that made it work? That's just to boost my warm fuzzies that it's really fixed, and not the result of lucky intermittency.  When I hit this, I lose both the current and the last month's data, so it's a big deal for me. I plan to remove my workaround on Dec 1.

Share this post


Link to post
Share on other sites

I changed my scripts at my customer sites, and this bug bit me again.  Here is one log file from a media set with 75 GB of data and 650,000 files. Note the file count, the data xferred, and the speed of the "verify".

 

 

+ Executing rotate_Skinnie at 1/1/14
To Backup Set Skinnie_last...
1/1/14 1:16:22 AM: Recycle backup: The Backup Set was reset
- 1/1/14 1:16:25 AM: Transferring from Skinnie
1/1/14 1:34:33 AM: Execution completed successfully
Completed: 77304 files, 14.8 GB
Performance: 859.3 MB/minute
Duration: 00:18:07 (00:00:31 idle/loading/preparing)

- 1/1/14 1:34:33 AM: Verifying Skinnie_last
1/1/14 1:34:33 AM: Execution completed successfully

 

The big difference here is that the engine at this site is running on a 10.6.8 Mac Mini. (not the 10.8 OS that I run in production)

Share this post


Link to post
Share on other sites

I re-ran the "rotate" scripts manually, but they may still fall short. The media set report shows that "skinnie" has 649,522 files, but the copy media set operation copies649,299 files, without any indication of error. Where are the remaining 223 files?

 

When I re-ran the script by manually recycliing "skinnie_last" and then running the "copy media set" script.

 

This makes me nervous.

 

Log below:

 

 

+ Executing rotate_Skinnie at 1/1/14 (Activity Thread 1)
To Backup Set Skinnie_last...
- 1/1/14 4:12:22 AM: Transferring from Skinnie
1/1/14 5:37:42 AM: Execution completed successfully
Completed: 649299 files, 66.7 GB
Performance: 818.9 MB/minute
Duration: 01:25:19 (00:01:54 idle/loading/preparing)

- 1/1/14 5:37:42 AM: Verifying Skinnie_last
1/1/14 6:07:00 AM: Execution completed successfully
Completed: 649299 files, 66.7 GB
Performance: 2,405.4 MB/minute
Duration: 00:29:18 (00:00:53 idle/loading/preparing)

Share this post


Link to post
Share on other sites

Note that I seem to have one installation at a customer site - Retro 9.0.2.107 (engine) that seems to work fine. The bad behavior appears to have been introduced in the 10.x versions.

Share this post


Link to post
Share on other sites

I continue to monitor this, and in the latest 2/1/2014 rotations, I saw this:

 

 

+ Executing daily_mercy_nfs_rotate_zero at 2/1/14 (Activity Thread 2)
To Backup Set v9_daily_mercy_nfs_last_tmp...
2/1/14 3:34:53 PM: Recycle backup: The Backup Set was reset
- 2/1/14 3:34:53 PM: Transferring from v9_daily_mercy_nfs_last
2/1/14 3:39:58 PM: Execution completed successfully
Completed: 22938 files, 10.1 GB
Performance: 2,047.7 MB/minute
Duration: 00:05:05

- 2/1/14 3:39:58 PM: Verifying v9_daily_mercy_nfs_last_tmp
2/1/14 3:40:31 PM: Execution completed successfully
Completed: 20891 files, 3.1 GB
Performance: 5,900 MB/minute
Duration: 00:00:32
 

 

Whether the data is right or not, having the mismatch between the data copied and data verified causes me concern.

Share this post


Link to post
Share on other sites

Bad news. It's still busted in 10.5. My log from the Mar 1 rotation:

 

 

+ Executing daily_household_rotate at 3/1/14 (Activity Thread 2)
To Backup Set v9_daily_household_last...
3/1/14 3:02:40 AM: Recycle backup: The Backup Set was reset
- 3/1/14 3:02:40 AM: Transferring from v9_daily_household
3/1/14 3:03:17 AM: Execution completed successfully
Completed: 564 files, 247 MB
Performance: 529.2 MB/minute
Duration: 00:00:37 (00:00:09 idle/loading/preparing)

- 3/1/14 3:03:17 AM: Verifying v9_daily_household_last
3/1/14 3:03:17 AM: Execution completed successfully

! 3/1/14: Recycle of Media Set v9_daily_household

 

Unfortunately, the a single "full" backup of this source is 789 MBytes, and 1496 files. The copy should have shown something like 5000 files and 900 MBytes. Judging from the log, the copy is only part of a set. Worse - no error reported.

 

Interestingly, the backups in the set run every day, and there should be a backup for every day from

1 Feb to 28 Feb.. The stunted copy shows only backups from 2/22 to 2/28.

 

I have tried to reproduce this, and it only seems to happen in the middle of the night when running unattended. It's madening, but it appears to be real. It also means that I cannot trust my rotations. I have delay the reset until after I have scanned the logs to ensure that the copies are complete. I might be overconfident that my copies *are* complete, judging just from the counts in the logs.

 

This is just the first instance of this bug I've seen. I have not finished scanning my logs. I may see more......

Share this post


Link to post
Share on other sites

I see this note in the latest 11.0.1 release:

Copy backup script failure with recycle enabled for source and destination (#4557)

 

Are there any additional notes available? For my customers who don't/can't upgrade, is there any sort of workaround?

 

What set of circumstances, exactly, triggers the problem?

  • Like 1

Share this post


Link to post
Share on other sites

Still interested in more detail on this fix. I'd like to use the feature, but don't dare until I either have that detail, or do my own debugging and testing. That took months for me in the past, only to get bit again.

 

Can I get confirmation/info on the fix?

Share this post


Link to post
Share on other sites

Sorry again for not following up sooner. The bug fix you mentioned from the release notes applies to a "Copy Backup" script that recycles both the destination (first) and then the source (last) sets--different from your scenario.

 

We haven't been able to reproduce your issue, but it's on our bug list. What is your current workaround if you don't use "Copy Backup" or "Copy Media Set" scripts?

  • Like 1

Share this post


Link to post
Share on other sites

I think this is exactly my scenario. I want to copy the "current" set to "last", recycling "last" at the start, and recycling "current" ONLY if the copy media set succeeds.

 

I will try it and report back if I see more troubles.

 

Thanks,

Share this post


Link to post
Share on other sites
I see your comment from 2013 about the "Recycle source Media Set after successful copy", but issue #4557 referenced in the release notes was introduced in the v11.0.0 release in March (and fixed in v11.0.1). That said, we found and fixed other data scenarios similar to #4557 as well, so v11.0.1 and v11.5 might fix your issue. Without being able to reproduce your issue, we can't say for certain.

 

From the logs in this thread, it looks like you've seen several different bugs:

 

* "Verify" pass is sometimes too fast: We'll look into this.

* "Recycle destination" option might have bugs: We'll look into this, but we tested this a number of times while trying to reproduce what you're seeing, along with our normal testing. We still haven't found a problem with this option.

* "Recycle source Media Set after successful copy": The issue you see here is connected to the one above, as it transfers too little and says it's successful. Is this correct?

* File counts being off by small numbers: This might be due to how we count hard links. We'll look into this.

 

Let me know if I missed something.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×