Jump to content

A warning to all: Check the disk where you have your media sets


Maser

Recommended Posts

Long story short...

 

I strongly encourage everybody to use Disk Utility to "Verify Disk" of the drive that contains your media set members.

 

My RAID containing my members -- *and* my backup RAID -- both had "disk catalog" errors.

 

This resulted in my losing hundreds of .rdb files -- going back months -- in many of my media sets.

 

I blame (frankly) the continued crashing/respawing of the engine under 10.6.2/10.5.8 that messed up the RAID volume.

 

This wasn't obvious until I tried to recatalog something -- that failed with numerous "missing" .rdb files.

 

I was able to *append* -- and groom -- these sets, but until I actually went to recatalog, I didn't realize there was a problem.

 

Boy, am I unhappy.

 

 

So consider this a fair warning: I strongly recommend you check the status of the disk containing your members -- you may be in for an unpleasant surprise. :-(

Link to comment
Share on other sites

Steve,

 

I feel your pain, but I'm not sure I understand the bug that you are reporting.

 

Are you reporting:

 

(1) a filesystem error (fixed by fsck / Disk Utility) of the underlying volume?

 

(2) a corrupted RAID set (what level RAID? RAID 5? RAID 1? something else?)

 

(3) a corrupted Retrospect catalog?

 

(4) .rdb files that weren't properly linked into the filesytem or closed / caches flushed?

 

(5) something else?

 

I can see how engine crashing could cause (3) or (4), but it's hard to see how an engine crash that didn't bring down the kernel could cause (1) or (2).

 

And, of course, if your "backup RAID" is a RAID 1 mirror of the primary RAID, then any problems on the primary will be propagated to the backup RAID.

 

I've personally seen (and have been bitten by) Apple Hardware RAID firmware bugs on the xServe G5 Apple Hardware RAID card (fails to fully flush write cache on graceful power down, causing corruption of RAID 5 set - RADAR Bug ID 4350243, never fixed, only workaround is to disable write cache), and it may be that you have been doing frequent power down cycles during Retrospect testing that have triggered a similar bug on your RAID. In a RAID 5, such bugs manifest themselves as "mystery garbage blocks from space". If the volume is large enough, the mystery garbage blocks might not land on a sensitive area and might not be noticed for a long time; when they land on filesystem data structures, it's catastrophic.

 

It's also possible that Retrospect 8 has been hanging, perhaps causing you to hard reboot the server, but, if you've got journaled filesystems on the RAID, that should have caught and resolved the problems.

 

What filesystems do you have on these RAIDs?

 

What errors were reported by Disk Utility?

 

Is it possible that you've got multiple problems that are interacting?

 

Is it possible that a corrupted filesystem might be the cause, rather than the effect, of some of the R8 engine crashing / respawning? I suspect that the error handling in R8, because of its immaturity (and perhaps lack of defensive programming) in situations involving complete garbage for the catalog, etc., might cause crashing / respawning of the engine. It may be that the engine has only been tested, if that, with perfect data input for the catalog.

 

Russ

Link to comment
Share on other sites

I have two external RAID5 LaCIE Bigger Quadra raids (HFS journal formatted)

 

One ("RAID1") contains my members of my media set. (Yes, it's called "RAID1" -- that's just the name -- it's a RAID5 raid.)

 

The other ("RAID2"), I make weekly (sometimes more) copies of "RAID1" to using Carbon Copy Cloner. 99% of the time I just do "file copies" of the incremental data to RAID2.

 

 

 

So, the *bug* I'm reporting is a long-standing, long-known bug: the engine crashes frequently with multiple concurrent proactive backups. Under 10.6.2, the engine crashes and dies. Under 10.5.8, the engine crashes and then starts up automatically.

 

But I'm of the opinion at this point that -- eventually -- one of these crashes is hosing the *disk catalog* of the drive containing the members of the media set.

 

*Or* the bug is somehow related to the "hangs" I get in Retro 8 (usually only with the console open for extensive periods of time) where activities will hang at a "preparing to execute" status message (requiring a stop of the engine...)

 

 

I've checked with a couple of local users of Retro 8 with similar setups. Their /library/logs/crashreporter folders are chock full of crash log reports (like mine.)

 

They haven't (yet) run a "Verify Disk" with DU on their RAIDs because they fear getting my results.

 

 

 

About 4 times since having this setup (and I've been using Retro 8 since the betas), when I reboot the engine computer "RAID1" doesn't come up and the Disk Utility "needs initialization" has come up. This is why I have the backup "RAID2" disk.

 

RAID2 is then put into place while I use CCC to copy the sets from RAID2 back to the reformatted RAID1 and then put RAID1 back in service.

 

However, at one recent point, I reformatted "RAID2" and did a *block copy* of RAID1 to RAID2 (which was faster than trying to do a file copy of the 1.6TB of data)

 

The next time RAID1 came up "needing initialization" (a couple days ago), I put RAID2 online as the primary repository (just changing the member location of the media sets) -- something I've done a number of times in the past.

 

This time, backups appended to 7 of my 8 media sets, but the 8th would crash each time it attempted to write to it.

 

This is when I found out that when I tried to rebuild the catalog -- the media set in question -- which had 770 files -- was only rebuilding from about about 70 of the files -- mostly the files dated about a week ago and beyond.

 

So I tried to rebuild a couple other media sets and had the same problem. Sets that were 200G were only recataloging 18G (for example). And some sets will not show me anything in the "Select Disk Media set member" window during the rebuild process.

 

Then I checked with DU and it said the volume couldn't be repaired (it's now in a "read only" state) -- which would make sense if the RAID1 volume was damaged at some point and my "block copy" carried over the damage to RAID1 to RAID2.

 

Since the last reformat of RAID1, there have been no "hangs" requiring hard power downs (in fact, I don't think there were any in any of the other 3 instances -- I thought in the past I was having problems with corruption when I was backing up to the RAID1 with Retrospect 8 while writing *to* RAID1 from RAID2 with Carbon Copy Cloner (which I don't do any more.)

 

It's entirely possible this was an OS issue.

 

However, I had *daily* engine crashes under 10.6.2 (and have *daily* engine crashes/respawns under 10.5.8).

 

It would not be beyond the realm of possibility (IMO) that these crashes while files (multiple threads) were writing to RAID1. And because I did a *block copy* at one point (which I won't do again -- regardless of how long a "file copy" clone takes), my RAID2 backup was also damaged.

 

 

 

So, at this point -- of my 8 media sets -- I can get about 10% of the data out of 5 of them and can't even *start* a rebuild out of the other 3 of them.

 

This means I've lost practically all backup data since September. And it was an unexpected loss.

 

I know now to run DU weekly (at least *before I do my cloning* to my backup RAID.)

 

And maybe run DU after ever friggin crash...

 

 

Link to comment
Share on other sites

They haven't (yet) run a "Verify Disk" with DU on their RAIDs because they fear getting my results.

Hmmm... sticking one's head in the sand and ignoring a possible problem is not a wise course of action. A filesystem that is wound up on itself like a gordian knot, and which is becoming more and more trashed as time goes on, is not a healthy situation. Ignoring the problem only makes it worse.

 

Then I checked with DU and it said the volume couldn't be repaired (it's now in a "read only" state) -- which would make sense if the RAID1 volume was damaged at some point and my "block copy" carried over the damage to RAID1 to RAID2.

yup. Effectively, RAID2 became a split mirror of a "RAID 1" of RAID1.

 

Since the last reformat of RAID1, there have been no "hangs" requiring hard power downs ...

That's a recipe for disaster, whenever a hard power down is done, especially on a hardware RAID array.

 

It would not be beyond the realm of possibility (IMO) that these crashes while files (multiple threads) were writing to RAID1.

Yea, but as long as the OS didn't crash and was cleanly shut down, Unix should have cleaned up the debris caused by a crashing process.

 

I know now to run DU weekly (at least *before I do my cloning* to my backup RAID.)

Won't hurt, but shouldn't be necessary as long as the OS doesn't crash and is shut down cleanly.

 

And maybe run DU after ever friggin crash...

Wise if the OS crashes or is not shut down cleanly.

 

Like you, I wish that R8 was more stable.

 

Russ

Link to comment
Share on other sites

I think my intent here is just to serve as a cautionary tale that people should check the health of the drive containing their data repository.

 

At best -- I've been able to restore about a weeks worth of data from most of my sets. Nothing going back to September, certainly, but at least I got the "backups" from the past week.

 

 

So if the problem occurred around (at my best guess) November 24th, there was no clear indication of this because file copies -- and grooms! -- continued to work from that point on.

 

 

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...