Jump to content

system freeze backing up to a file set on an iscsi volume

Recommended Posts

Got a weird one here.


G5 Xserve running Server 10.4.9 & Retrospect 6.1.126 with driver Have about 30 network clients backing up to file storage sets that were previously stored on an external usb drive. We have successfully moved 28 of them to a volume mounted from an Open-E iscsi appliance using Atto XtendSan iscsi initiator software. However, the remaining two cause Retrospect, and eventually the entire system, to freeze when it goes to back them up. One is a large backup - 186GB and counting - the other is much smaller, about 16GB. One is a ppc client, the other is Intel. Sometimes it freezes as soon as it goes to start writing to the storage set, other times it gets 90+ percent done and then freezes. If you force-quit Retrospect then the Finder will freeze and will not relaunch, and on attempted restart or shutdown the entire system freezes and must be cold started with the power button.


As soon as I move the storage set back to the usb drive, it works fine. There is nothing in the retrospect log, or in the system logs to give me a clue whats going on here. I am able to move large files on and off the iscsi volume with the Finder without issue.


Anyone have any ideas? There dont seem to be alot of people on the forums here who are backing up to iscsi volumes.



Link to comment
Share on other sites

First, note that Dantz changed "Storage Set" to "Backup Set" forever ago. So to help newer users understand, I'll stick to the current termonology.


> One is a large backup - 186GB and counting - the other is much smaller, about 16GB.

> One is a ppc client, the other is Intel


- Does this mean that each of these two problem Backup Sets backup data from only a specific phycial client computer? If so, this would make some testing easier.


First thing to try would be to breakout the Backup Sets themselves (which work on one physical/logical media, but fail on another) from the specific Sources they use.


- Can you create a new File Backup Set and attempt to copy data from one (or the other) of the "problem" clients? If so, might you be able to Recycle the problem Backup Set and simply do another full backup?


If the problem follows the client(s) can you install the Initiator software onto another machine? That way if things continue to be a problem you don't have to reboot a working server.

Link to comment
Share on other sites

Oops, yes I've been using retrospect since shortly after the earth cooled. Old terminology dies hard sometimes. smile.gif


You are correct, each of the 30 clients backs up to its own Backup Set, so that does make matching easier, and in the case of the two problem ones, the 186GB set is exclusively from a G4 and the 16GB one is exclusively from a MacBook Pro. I have tried creating new sets for each one but the freezes continue. After recovering from the freeze, its generally a good backup of what its has managed to get. Its not corrupted at all, though it may need a little catalog repair.


I have reinstalled the client software on each of the problem systems, v 6.1.130 fresh off the Insignia site. It doesn't make any difference, the freezes continue. I also thought it may have something to do with the size of the iscsi volume - 1.78TB - but I created a smaller 500GB iscsi volume and tried placing the sets on it and it still froze. The usb drive is 699GB.


Unfortunately I only have the one server available to run these from. For the meantime I have kept the usb drive connected and am still using it for these two, but I need that drive elsewhere. I also have a few more Storage, er Backup Sets I need move to the iscsi but I want to know what's going on before I move them in case any of them have trouble too.



Link to comment
Share on other sites

ok, so you have 28 clients, and all of them work correctly exept for two. That's a big enough set to rule out anything wrong with your general setup (although I assume the Atto software uses kexts, and might be the actual underlying cause of the freezes).


The Backup Sets from the old physical media are not at fault, since fresh ones also fail to work, so we can point to the client, and/or the data on those clients.


- If you define a small folder as a Subvolume, and use that as a Source, can you get reliable executions?


> Unfortunately I only have the one server available to run these from


Why does it have to be a server? I've yet to see a Mac shop with 30 machines that didn't have some old iMacs hidden away in a broom closet somewhere. Or how about your machine? It would be great to be able to reproduce the issue on another Retrospect install, and further narrow down the clients as being the trigger.


Of course, nothing on a client machine should cause the backup machine to freak out this way. But this low level software was probably never tested against the version of Retrospect that you (and the rest of us) are using. It would probably be worthwhile to open a support incident with EMC over this, once you have a few more data-points.



Link to comment
Share on other sites

Thanks for your help guys.


They are actually backing up a subvolume I should have mentioned that. By default I create a subvolume of the Users folder and exclude the Movies and Music folders and all the cache files. Its the only way to keep the sizes manageable (Plus I really dont care if your rip of 'Gili' gets lost in a crash smile.gif ) But the two that aren't working and all the rest that are were created from the same template script and all use the same selector. The 186GB one is the largest by a fair margin, but there are plenty of others in the 2-110GB range that work fine. I also ran drive repairs on both client's hard drives (nothing found) and even tracked down the specific file they were working on when it froze and removed them, but then it crashes on another random file.


Something freaky in the network is also on my mind, but all 30 of these clients are on the same public and routable subnet. They're all on dhcp but their addresses are registered with the server so they never change. The retrospect server is on a different public and routable subnet in this building but we have fiber interconnects so throughput has never been an issue - I don't think anything is timing out. There is a firewall between them but again, the traffic between the server and all the clients passes through the same filters - why would it interfere with just two? And what possible effect could switching to a usb drive have on a firewall?


What I need to figure out is what makes these two clients different from all the rest. But as far as I can see there is absolutely nothing in common between them that is also different from the others that are working. I've been beating my head against the wall on this one for over a month.


When I look at the state of my semi-frozen server - I can launch the terminal after Retrospect locks up but before I try to restart it and the whole thing goes down - running top tells me there are 3 stuck threads. Does anyone know how I tell which they are? I think one is a umount process - like something has caused it to want to dump the iscsi disk but something else is not letting it.



Link to comment
Share on other sites


What I need to figure out is what makes these two clients different from all the rest.


The easiest way, from what you have described, would be to put one of them on the same switch port as one of the working clients. That would eliminate all of the level 2 and 3 network issues.



I can launch the terminal after Retrospect locks up but before I try to restart it and the whole thing goes down - running top tells me there are 3 stuck threads. Does anyone know how I tell which they are?



ps axlww -O "flags"

open your Terminal window up wide. The unrunnable processes will have a "U" in the status ("STAT") column. Look at the Flags ("F") column. I would bet that the threads are stuck on physical I/O. Here are the flag bits:


             P_ADVLOCK      0x00001        Process may hold a POSIX advisory lock

P_CONTROLT 0x00002 Has a controlling terminal

P_INMEM 0x00004 Loaded into memory

P_NOCLDSTOP 0x00008 No SIGCHLD when children stop

P_PPWAIT 0x00010 Parent is waiting for child to exec/exit

P_PROFIL 0x00020 Has started profiling

P_SELECT 0x00040 Selecting; wakeup/waiting danger

P_SINTR 0x00080 Sleep is interruptible

P_SUGID 0x00100 Had set id privileges since last exec

P_SYSTEM 0x00200 System proc: no sigs, stats or swapping

P_TIMEOUT 0x00400 Timing out during sleep

P_TRACED 0x00800 Debugged process being traced

P_WAITED 0x01000 Debugging process has waited for child

P_WEXIT 0x02000 Working on exiting

P_EXEC 0x04000 Process called exec

P_NOSWAP 0x08000 Another flag to prevent swap out

P_PHYSIO 0x10000 Doing physical I/O

P_OWEUPC 0x20000 Owe process an addupc() call at next ast

P_SWAPPING 0x40000 Process is being swapped

for details,

man ps


If they are retrospect processes, you might try something like:


ps axlww -O "flags" | fgrep retro

to whittle the data down to something manageable.



Link to comment
Share on other sites

  • 5 weeks later...

Just to give this topic a bit of bump and an update - I have been working with the Atto support people on this (they are fantastic) and they sent me a beta of the next release of the iscsi intiator and it has fixed the problem with the smaller of the two systems but the large one continues to freeze. We are trying to get a tcp dump of the communication to the iscsi device but the crash seems to render the dump file corrupt.


Then I moved another backup set to the iscsi storage and lo and behold it started freezing in the same way. Its a pain but I have spotted something - these 3 sets that are causing problems have the largest catalog files of all my backups. The one that started working it the smallest of the three at 99.5MB - the other two are well over 100MB. So my suspicion is focused on the catalogs at this point - probably writing to the catalogs since it seems to ba able to scan and match okay its only when it goes to start writing data that it locks up. Does anybody know if Retrospect does something different to large catalogs than to small ones?



Link to comment
Share on other sites

More info:


Thinking the problem may be in writing to the catalog rather than the data file, I tried putting the catalog on the local storage where is has been known to work, and using first an alias and then a soft link to point to the data file on the iscsi volume. They both failed outright, but not by freezing. This time I got error messages: 105 (unexpected end of data) and 211 (media locked) respectively. I didn't really expect it to work, but the interesting thing is, having now pointed it back to the catalog on the iscsi volume its not freezing now, it continues to give me error 105. The only other thing I changed was to set the "ignore permission on this volume" for the iscsi volume.


The knowledgebase article for error 105 seems to be missing, but a few others that refer to 105 under specific circumstances (writing to a CD or to a certain model of Samsung drive) seem to indicate that driver updates were necessary to resolve those issues, so it sounds like pretty low-level stuff.

Link to comment
Share on other sites

Having the individual parts of a separated File Backup Set (Foo and Foo.cat) in different locations is not supported. Be nice if hard/soft links fooled the program, but it was never designed for that and it doesn't work.


In general, the Finder's "respect ownership" setting of a volume where a File Backup Set is stored doesn't matter, as long as the volume is writable. But perhaps iSCSI is different. Perhaps there are ownership issues with this technology.


In your communication with ATTO, you should be sure to let them know that the program attempting to write to the volume is running with a different UID then the Finder user. Or, you could try logging into the Finder as root (you might have to set a root password in Netinfo Manager first) and seeing if it makes any difference where both Retrospect _and_ the Finder user are both UID=0.



Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...