Jump to content

Retrospect 5 stalls during network backup


Recommended Posts

Retrospect seems to sometimes stall while backing up network clients. Twice now I have come back expecting Retrospect to have filled up the first tape in a new cycle, and both times there was a Net Retry dialog on screen, and the backup had apparently stalled for over a day. Both times I was able to quit Retrospect, although it took several minutes to respond to clicking the Stop button.

 

 

 

Anyone having similar problems, and able to shed any light? I'm going to search the Knowledge Base for the Net Retry error and attack from that end too. I may have a real network problem, although the day-long stall seems likely to be a bug too. :-)

 

 

 

Running the backup on a G4, Mac OS X 10.1.3, SCSI VXA, 384 MB RAM, clients are over 100 Mb ethernet. I am running the Mac OS X Screen Saver, but the backup runs fine for hours after that first kicks in. The two stalls so far were while backing up Mac OS 9 clients, although they were not the first such clients to be backed up.

 

 

 

+ Normal backup using Desktop Mac (on demand) at 28/3/2002 6:27 PM

 

[...]

 

- 29/3/2002 11:56:48 AM: Copying Macintosh HD on Graeme Mac…

 

29/3/2002 11:56:48 AM: Connected to Graeme Mac

 

31/3/2002 4:00:46 PM: Execution stopped by operator.

 

[Net Retry on screen, clicked stop button, long delay before Retrospect responded.]

 

 

 

+ 31/3/2002 4:11 PM: Backup Server started

 

 

 

+ Normal backup using Desktop Mac (weekend) at 31/3/2002 4:13 PM

 

[...]

 

- 31/3/2002 7:01:49 PM: Copying Spare Bit on MikeGill Mac…

 

1/4/2002 9:24:15 PM: Script execution terminated at specified stop time.

 

31/3/2002 7:01:49 PM: Execution incomplete.

 

1/4/2002 9:24:15 PM: Execution incomplete.

 

[Net Retry on screen, clicked stop button, long delay before Retrospect responded.]

 

 

Link to comment
Share on other sites

I am seeing this as well.

 

 

 

Retrospect 5.0 running on OS 10 will stall while backing up network clients and will display a net retry dialog box indefinitely or until ether the client computer is some how able to reestablish communications or some one cancels the backup. Several times I have check up on Retrospect 5.0 to see the Net Retry dialog on screen, and the current backup server stalled for 20 hr.

 

 

 

To prove this all I need to do is take a notebook computer off the network while it is being backed up and the backup server is stuck in infinite Net Retry mode, version 4.3 would at least give up after a few min. when this happened. With 70 notebook computers on my network I can see my backup server being stuck in Net retry most of the time. Is there a fix known?

 

 

 

 

Link to comment
Share on other sites

I'm seeing the same problem. Retrospect Workgroup is running on a 500x2 G4 running Mac OS X Server 10.1.3. The client is on an 800x2 G4 running Mac OS X 10.1.3. When Retrospect tries to communicate with the client, I get the "Net Retry" dialog.

 

 

 

Everything's on the same subnet, and the network works. Only the Ethernet interfaces are enabled in the Network control panel. I can ping and AFP between the two machines.

 

 

 

I followed the troubleshooting instructions, and when I checked to see that pitond was running, I found that there are 24 instances of it running!

 

 

 

Some history:

 

 

 

The OS X Server is a fresh install, with no previous version of Retrospect having been installed.

 

 

 

I did have the beta OS X client on the client machine (running against Retrospect 4.3), but did remove it before installing the new client. I installed the new client, but when I ran it, it complained that it hadn't been installed as "root". I tried installing with "sudo open ...", got the same result. Finally, I logged on as root installed, and then things worked.

 

 

 

The first time I ran the backup, it worked fine. When I tried to check the status from the client machine, it listed no backups, even though a backup had succeeded. The first time a backup failed, I tried turning the client off, and it immediated turned itself back on. I repeated this process a few times with various modifier keys, always the same result. I then tried logging on as root since I had been forced to install from the root account. Logged in as root, I was able to view the history and turn the client on and off. I find this situation (needing to log in as root to do anything with the client, even view history) highly unacceptable as I don't like logging in as root, too easy to shoot yourself in the foot.

Link to comment
Share on other sites

Among other things I've experienced thusfar using 5.0 Server, I've had it pause, doing a retry on four seperate events. I can't remember if it was two or three of those times in which the pause was while backing up our fileservers main HD, a 40GB drive with ~120,000 files on it, mostly MS Office docs.

 

 

 

I'm running OS 9.2.2 with 5.0 Server, 384MB RAM, 200MB to Retro, an Asante 10/100 PCI card (newer), and 35/70GB single tape Lacie AIT. The fileserver is an ASIP 4.3 server, running 9.1. I have also had net retrys on an OS X client and an OS 9.2 client.

 

 

 

Other errors I've encountered are elem.c-812 errors (fixed with new patch :-) ), elem.c-817 errors (created by new patch >:-( ), Retro server crashing with type-2 errors (fixed?), oh, and the major one - crashing my ASIP 4.3 fileserver! Both before and after running the patch!

 

 

 

Bah!

Link to comment
Share on other sites

I may have isolated my hangs to one particular client (which was previously working with 4.3 server and client), but have had a variety of other problems meaning I am not sure yet. I am installing the new server update, reinstalling on the problem client, and crossing my fingers and toes.

 

 

 

Hopefully I will complete my first full backup with Retrospect 5 soon, and be back to smoothly running incrementals!

Link to comment
Share on other sites

I'm pretty sure I ran the beta installer to remove the beta. However, I found instructions in another thread (which I believe you wrote) for verifying that the old client was removed via terminal commands, and found that I had at least parts of two old clients. I followed the instructions there for removing the betas manually, then reinstalled the 5.0 client.

 

 

 

Now everthing is working great, the backups work and I don't need root to check status.

 

 

 

Thanks for the help.

Link to comment
Share on other sites

The beta client readme had uninstall directions for the beta client.

 

 

 

To completely remove the Retrospect client beta for Mac OS X, you will need to log in as root and manually remove the following files and folders:

 

 

 

/Applications/Dantz Beta/

 

/var/log/retroclient.log

 

/var/log/retroclient.history

 

/var/root/retroclient.state

 

/Library/StartupItems/RetroClient

 

 

Link to comment
Share on other sites

  • 2 weeks later...

My problem clients have always been Mac OS 9 so far. One client in particular stalls the backup every time. Dual 450 MHz G4, Mac OS 9.0.4, client 5.0.198. Client has always been shutdown into the Retrospect client so far, I have not yet tried a backup while the system is up.

 

 

 

Had a dream run with 48 hours of almost continuous network backups, went home, and five minutes after I left the bad client got to the top of the list. 26 hours later Retrospect gave up with a 519 network error. Client was still top of list. 10 hours later, Retrospect gave up at script stop time. Client was top of list for another script. 2 hours later I suspect someone rebooted the client, although the log shows another 519 error.

 

 

 

I will eventually get this client working (already tried reinstalling client, disk repair software next on list, then the full 519 trouble shooting guide), but I think the server should behave better too. :-)

 

 

 

I am tempted to set the minimum transfer speed, but since the backup never starts this may not help, and this may also introduce problems with the Mac OS X clients.

 

 

 

(Server is currently dual 450 MHz G4, Mac OS X 10.1.4, Retrospect 5.0.203, backing up to VXA tape drive over firewire with Microtech scsi/firewire adaptor.)

 

 

Link to comment
Share on other sites

I'm seeing the same thing in spades.

 

 

 

I back up across a VPN that spans a significant distance, and latency and packet loss are part of the terrain. In addition, I have a flaky DSL line at one site that periodically loses sync (and takes about 30 seconds to retrain.)

 

 

 

I see two scenarios. One is the "net retry" message that may hang for days. The other is Retrospect giving up with an error 519. In both cases, connectivity is basically good, with occasional random packet drops and those 30 second outages. In either case, I can successfully ping the client machine while the "net retry" message is up--these are not long-lived outages by any means.

 

 

 

It appears that the retry mechanism is broken at a number of levels. Firstly, it may hang forever, which it should never ever do. A timeout (configurable please) after which the connection will be closed should be easy to implement. Secondly, the retransmission mechanism seems broken. I've seen other problems under OS X (the Mail program will wedge in the face of packet loss, for instance) which leads me to believe that perhaps the OS X TCP implementation or the OT interface to it is messed up.

 

 

 

I had some problems under 4.3 and OS 9, but not to this scale. (The biggest problem that I had with 4.3 was that the backup server would get an error 519 but the client would not see the connection close; the client would then block any further connections forever. I ended up having to put in an Applescript that turned the client off and on again every 24 hours. Dumb.)

 

 

 

All this leads me to believe that Retrospect is not tested with a packet drop simulator in the middle, given how easily packet drops seem to confuse it. This is not acceptable.

 

 

 

For the record, the server is running OS 10.1.4 and Workgroup 5.0.205; the client is running OS 9.2.2 and Retrospect Client 5.0.198. I'll do some investigation to see if I have any difficulty with OS X clients.

Link to comment
Share on other sites

I dug into this quite a bit. The short version is that it appears that the problem only appears to happen when backing up an OS 9 client (9.2.2 in this case) to an OS X server (10.1.3 in this case.) The problem occurs with both the 4.3 client and the 5.0.198 client.

 

 

 

The scenario is a VPN running across the public Internet, with T1 access at the server end and fast DSL (1.5Mbps uplink) at the client end. The end-to-end latency is roughly 125 msec.

 

 

 

The problem is that data stops flowing at some random point in the process, sometimes during the scan, and sometimes during the actual backup. The "net retry" screen comes up and stays for awhile. In some cases the backup eventually terminates with an error 519, but in other cases the server simply hangs until the scheduled period for the backup ends (blocking the backup of any other clients for that night, bleah.) There is no connectivity loss between the client and server when this happens (as indicated by leaving a ping running between the server and client.)

 

 

 

When backing up OS X clients at the same site via the same path, no such problems occur.

 

 

 

I did a tcpdump capture of the session, and the results were interesting. Things were flying along for a number of minutes, and then apparently one of the data packets coming from the client got lost. The server TCP dutifully stopped changing its ack sequence number, but the data continued to flow from the client (not surprising given the latency and whatever queueing is taking place on the client.) Data continued for 1.75 seconds after the packet loss, delivering an additional 65536 bytes of data. Then the client caught the clue that the server had stopped acking data, and retransmitted the missing packet. The server then acked all of the data and reopened its receive window, but the client never sent another packet after the retransmission. Once the server acked all of the data, it attempted to send a packet back to the client, and then the packet is retransmitted for several minutes with exponential backoff eventually raising the retransmission interval to 64 seconds. After eight minutes the server throws in the towel and sends a TCP reset, and returns an error 519.

 

 

 

In another case, the server had no data to send at the time the client went quiet, so once the server acks all of the client data, there is no communication at all (from TCP's point of view, everything is hunky-dory, there's just nothing to say.) In this case, the server hung indefinitely after displaying the Net Retry. This leads me to believe that the termination and reporting of error 519 is triggered by a retransmission failure rather than purely as a timeout, which is broken.

 

 

 

It is also interesting that the client continued to send data all the way up to the 64K limit (presumably the retransmission buffer size in the client) but ignored the fact that the server never advertised a TCP window larger than 33304 bytes. This doesn't hurt anything (and the server apparently kept all of the data that arrived outside of the advertised window, since it acked all 64KB once the the retransmission took place) but it seems odd.

 

 

 

In all of the cases I've investigated, the client always disconnected (became ready again) reasonably quickly, even in the cases where the server hung (another server was able to attempt to back up the machine while the first server was hung, but of course suffered the same fate.) Interestingly, there is no sign of a TCP FIN or RST from the client at the point at which it gave up. In the cases where the server declared a 519 error, it sent a TCP reset, but when it hung it did not (unsurprisingly.)

 

 

 

So it looks like there are at least two problems at work here. One, something about the OS 9 client (and/or the OS 9 TCP implementation, which is presumably part of Open Transport) gets confused in the packet loss case and never resumes data transmission after the packet loss and retransmission. Second, the server does not always time out to return a 519 error when the client fails in this way. A minor third bug is the client TCP's ignoring of the advertised window from the server.

 

 

 

I'll send the tcpdump logs by email.

Link to comment
Share on other sites

  • 2 weeks later...

Wow,

 

 

 

I can't thank you enough for digging that deep. We are having the exact same problem. So far the only thing that has had any effect is changing the port configuration on the cisco switch.

 

10/100 set to AUTO

 

duplex set to AUTO

 

Portfast is ENABLED

 

 

 

enabling portfast made a big difference.

 

 

 

I'm still irked that it doesn't timeout and move on when it hits a 519.

Link to comment
Share on other sites

  • 2 weeks later...

I am experiencing similar problems - all of the clients on our new G4 (Mac OS X) backup server are set to backup over night, and every time I have checked so far, Retrospect has hung waiting on one of the clients (last night it was an AppleShare IP 6.3.3 server which crashed in the process), and of course none of the remaining systems had been backed up.

 

 

 

Is there a way to specify a time out, or is there a bug fix in development?

Link to comment
Share on other sites

The "stalls" in backup are a showstopper preventing us from upgrading to Retro 5.0. I see problems much like those described here (Retro 5.0 on OS X backing up OS 9 clients, come back in the morning to find the server has spent 7 hours doing "net retry" on a single client). I think the problem may be more general, though -- this morning I find the server hung and unresponsive to anything except "Force Quit", apparently having stalled sometime during the "compare" stage in backing up a Windows 2000 client. On the same computer (in OS 9), Retro 4.3 server occasionally has 519 errors, but they don't hang up the entire backup process, so it's much more effective in getting the main task accomplished (back up any many of our 200+ clients as possible; if a few get missed occasionally, too bad).

 

 

 

Is there no way for us to help provide debugging information to help solve this? Do you want tcpdumps from the server?

Link to comment
Share on other sites

Similar situation here, but different configuration which may assist diagnosis and a fix (hopefully).

 

 

 

iMac FP running OSX 10.1.4 Retrospect Desktop 5.0.205. Clients are Windows XP over 802.11b wireless network.

 

 

 

When backing up either client, retry window appears and Retrospect hangs until stop is pressed on retry window. Most interesting client machine hangs and requires reboot (!).

 

 

 

Yikes.

 

 

 

 

Link to comment
Share on other sites

Retro 5.205 is setup on a G4 733 (OS X) to back 3 other computers and itself all running Mac OS X. The backup script continually stops w/the Net Retry.

 

 

 

I have not been able to back ANY machine for over a week.

 

 

 

Any ideas or work arounds?

Link to comment
Share on other sites

> Retro 5.205 is setup on a G4 733 (OS X) to back 3 other computers and

 

> itself all running Mac OS X. The backup script continually stops w/the Net Retry.

 

 

 

Suggestions:

 

1) If it is always one client, try troubleshooting that client. I have one client I can not backup right now, it always stalls after simple fixes, and the user is not ready for extreme measures right now.

 

2) If it is every client, it might be your server, or your network. Try backing up from a different computer.

 

 

 

My basic understanding is that the Net Retry dialog indicates a problem with your network communications, which you can troubleshoot. The indefinite stall on the Net Retry is presumably a bug in Retrospect, but you still have a problem worth troubleshooting irrespective of the bug.

 

 

 

There are some exhaustive suggestions for troubleshooting 519 errors in the KnowledgeBase which will give you some good ideas.

Link to comment
Share on other sites

  • 2 weeks later...

Well it is one client. Everything being backed up is running 10.1.5, but this particular machine is a B&W upgraded G4. All versions of R, are the latest.

 

 

 

I believe I can get the machine to back up if it was just in use. All of the other machines back up fine regardless. The operation will get stuck in the net retry if the back up machine isn't being used. (Just in the case of B&W) If you wake the machine (back up server) up, it will time out after a short time. Otherwise it hangs there indefinitely.

 

 

 

Does anyone else have problems backing up B&W machines?

Link to comment
Share on other sites

In reply to:

Does anyone else have problems backing up B&W machines?


 

Or, does anyone else have problems backing up B&W machines with upgraded CPU's?

 

 

 

I just added to the "Bootable Duplicate" thread, where the user having a problem also has a B&W with a G4 upgrade.

 

 

 

Perhaps it would be a good idea for you to test with the original Apple hardware.

 

Or, at the very least, include the specific hardware configuration in your post so others can compare (and contrast) with their experiences?

 

 

 

DAve

Link to comment
Share on other sites

I had reduced my indefinite Net Retry scenario to an occasional problem by excluding one particular client from the backups, this client always caused the problem. Finally had a chance to take over the computer. A backup over a direct cross-over ethernet cable showed the same problem. Disk First Aid reported no problems. Used the 519 troubleshooting guide and tried Test in Drive Setup, within seconds it reported an unrecoverable error. A local backup confirms that the hard drive is faulty!

 

 

 

So as usual Retrospect was the first thing to detect a problem.

 

 

 

Of course I still don't think the server should block on a bad client, but one step closer to everything working. :-)

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...