Jump to content

6.1 Slow performance


Recommended Posts

Hi there,

 

since the weekend I experience very slow performance when backing up clients.

 

System is a quiet new XServe, most recent Retrospect and Client version installed.

 

As there was nothing changed it is really strange.

I can see in the Activity Monitor that the network speed goes up to 11 MB/s and immediately goes down for a while and then goes up again. So the speed is basically there (clients are connected via 100MBit). The result is that all clients run between 100-200 MB/min whereas it was before something like 450-700 MB/min.

 

Does anyone have a hint how to fix it ?

Link to comment
Share on other sites

Could you provide exact specifics of the versions, and Mac OS X version running on xServe and on the clients?

 

What changed this past weekend (other than daylight savings time)? Something had to change.

 

Consider the possibility that you are thrashing because of lack of RAM, and just passed a threshold that is causing lots of paging. Retrospect gobbles RAM as the number of files in the backup set (accumulated over time) gets large. Lots of sorting and comparing happening.

 

Way to test this would be to try a backup to a new backup set, and/or to look at paging activity.

 

Russ

Link to comment
Share on other sites

Hi Russ,

 

it is

 

Intel XServe 2.26 Quadcore, 6GB RAM, 3HDDs, ATTO UL5D, Quantum SuperLoader3 16-Slots LTO3

 

OS X Server 10.5.8

Retrospect 6.1.230

RDU 6.1.16.100

 

Clients mixed, 10.3 server, 10.5 clients, all clients run Retrospect client 6.2.234

 

Daylight saving can´t be the issue as this was already the week before. No, nothing changed over the weekend. I even did a restart of the machine on Monday. When locking at memory Retrospect takes about 1GB of VM and nearly nothing of physical (80MB). Nothing else is running on the machine so about 4GB of physical is available.

 

What I will try then is, to run a new media backup, though earlier than it should be, but at least a chance to check. There are currently 3 LTO3 used in this set, but I cannot tell how many files are in the set. Sure, it must be quiet a bit, but the range between the new media backup is the same as for the past 4 years and the amount of files is not that much growing, but the file sizes.

 

Otmar

Edited by Guest
Link to comment
Share on other sites

Well, the thrashing could be happening on either side of the network connection. If your clients are similarly configured, perhaps something pushed them over the edge on RAM.

 

Also possible that something happened to saturate your network (perhaps some device is flapping?) or somehow your server has gone into half duplex, etc.

 

Have you checked out your network? Have you tried moving the xServe to a different switch port, or using a different cable? Athough 100 MBit is pretty tolerant of cables, not as fussy as GigE.

 

Has local backup speed on the xServe gone down too? Consider that you might have a disk failing on the xServe, and that it's spending a lot of time doing retries, or that you have an ECC situation on the xServe's RAM that is taking up a lot of time doing logging of the errors, etc. Been there, done that, shortly after we put our xServe into production until the bad ECC RAM module was replaced.

 

Russ

Link to comment
Share on other sites

Hi Russ,

 

my first contact were the network guys ;)

 

No, they checked the switch for errors or misconfiguration. Nothing found. The Xserve is still at 1GBit connected and the clients are still on 100Mbit. No drops or errors on the interfaces.

 

All MACs are bad ... so they have their own subnet not to interfere with the "good" systems. The only thing I couldn´t reach so far is to get a dedicated 1GBit line for our "bad" systems.

 

Local Backup speed is fine. I did a full backup of the backup server itself on Tuesday and it was between 1500 and 3000 MB/min to LTO3.

 

Regarding the ECC errors. Where would they be reported ? In Activity Monitor ? This would be another possibility.

 

Otmar

Link to comment
Share on other sites

ECC errors are reported in Apple's Server Monitor program, choose the Memory button/tab. Also logged to the console, but they can get lost in the noise.

 

The errors are of the form:

 

Dec 6 19:14:19 mail kernel[0]: WARNING: 2 parity errors corrected in DIMM1/J12

Dec 6 19:17:26 mail kernel[0]: WARNING: 2 parity errors corrected in DIMM1/J12

and, in Apple System Profiler:

 

Memory:

 

DIMM0/J11:

 

Size: 512 MB

Type: DDR SDRAM ECC

Speed: PC3200U-30330

Status: OK

 

DIMM1/J12:

 

Size: 512 MB

Type: DDR SDRAM ECC

Speed: PC3200U-30330

Status: ECC Errors

ECC Correctable Errors: 6

 

DIMM2/J13:

 

Size: 512 MB

Type: DDR SDRAM ECC

Speed: PC3200U-30330

Status: OK

 

DIMM3/J14:

 

Size: 512 MB

Type: DDR SDRAM ECC

Speed: PC3200U-30330

Status: OK

The Apple System Profiler numbers get reset on reboot. When the errors come fast and furious, lots of logging can happen.

 

Russ

Link to comment
Share on other sites

All MACs are bad ... so they have their own subnet not to interfere with the "good" systems. The only thing I couldn´t reach so far is to get a dedicated 1GBit line for our "bad" systems.

 

Local Backup speed is fine. I did a full backup of the backup server itself on Tuesday and it was between 1500 and 3000 MB/min to LTO3.

You know, this is [color:red]so[/color] very suspicious.

 

Sure points to a networking issue. Might be interesting to pull one of the clients in, attach to a hub or switch directly attached to the xServe, see what results.

 

In fact, everything on the xServe has to be working perfectly except perhaps for the xServe's network interface card (NIC).

 

So, consider that something funny could be happening with the xServe's NIC.

 

But it's got to be a networking issue.

 

What do the network statistics show for collisions, bad packets, etc., on the xServe's interface for traffic with the clients?

 

Russ

Edited by Guest
Link to comment
Share on other sites

Hi Russ,

 

so, I was just checking for the ECC errors. There are none reported.

 

As I was just on the machine and one of the clients was being backed up at 100MB/Min, I opened a terminal on the client and on the server. Each running netstat to show what happens. I added "-w5" to see output every 5 sec. 10sec. of good transfer and then 30-50 sec of silence. Same on both NICs. So I logged on to the server and mounted a volume on the file server and transferred about 20GB. No problem. Client NIC remains the same, server NIC goes up to about 90 MB/sec.

 

So, what does that mean ? Actually I don´t know. What I did in advance is, to transfer files from the file server to a different client than the one being backed up. The network speed is absolutely fine.

 

So it can´t be the switch for the clients nor the one for the servers. Network speed is fine except for Retrospect operations. No errors on any interface reported.

It can´t be the drive/card as local speed is fine.

What remains is the backup set itself. So next option is to run the new media backup !?

 

By the way, as there is a "new" 8.x update. Is that one more reliable and worth another try ?

 

Otmar

Link to comment
Share on other sites

Perhaps your analysis is right, but there's still the possibility of something flapping in the wind on the network (a jabbering NIC, or spanning tree issues, etc.). It would be interesting to take the rest of the network out of the loop and have the xServe and client attached through a local switch at the xServe. That's the only way I know to eliminate the possibility of network issues.

 

What happens if you do the network monitoring on port 497 traffic while a Retrospect backup of a client is in play?

 

Retrospect 8 gets more reliable with each bugfix release.

 

For us, there are two show-stopper issues:

 

(1) can't read older backup sets - Hard to imagine that anyone would ever need to restore from an old backup set, but, to me, that is the only purpose of a backup program....

 

(2) not scriptable - makes it impossible to coordinate stopping / checkpointing / starting services on the server.

 

Russ

Link to comment
Share on other sites

Hi Russ,

 

as this is a bit difficult from remote, I will try to do a new set with just one client (once the script for this evening are through). After that I can still see tomorrow morning what to further check. And during the backup I can monitor the "Retrospect port" and see what happens.

 

Otmar

Link to comment
Share on other sites

Sounds like a plan. I realize you can't reconfigure topology remotely, but I've never had any trouble doing Apple Remote Desktop through a VPN tunnel to our server from a remote location, and controlling Retrospect that way. It's no better or worse than doing things onsite because our server is headless.

 

Russ

 

 

Link to comment
Share on other sites

Yes, that´s what I do, except for ARD I use VNC for a couple of years now and it is fine with me.

 

I still have the hope, that I can switch to 8 with one of the next releases. If it is stable enough I would be even willing to do "my" disc-to-tape version with many scripts instead of multiple backup sets. The idea was to somehow "cache" the data to local discs and then store that data to different tapes (onsite,offsite,rotation,etc.) Unfortunately 8 just gives the option to have one disc backup to one backup set destination, instead of the option for other sources to different sets.

And I would also like to have the scripting possibility as I think the build-in mail functionality is rather basic than usable in a larger environment.

 

EDIT:

Forgot to say, that the speed for larger files is slower than for smaller files. This is really strange and is visible on any client for example when the Entourage DB is being backed-up. For smaller files speed goes up again.

Edited by Guest
Link to comment
Share on other sites

Forgot to say, that the speed for larger files is slower than for smaller files. This is really strange and is visible on any client for example when the Entourage DB is being backed-up. For smaller files speed goes up again.

That's really interesting. Investigate the MTU on the paths from the xServe to the clients, and things like DF (don't fragment) setting, etc. Perhaps something did change in your network infrastructure with regard to MTU or windowing size. Retrospect tries hard to maximize throughput by using big stuff and pumping the packets fast. If something changed in the path, perhaps big packets, and lots of them in a row, are getting dropped and retransmitted.

 

That might not show up with Finder type copies, which might use smaller packets, etc.

 

Russ

Link to comment
Share on other sites

Hi Russ,

 

so I spent nearly the whole day in narrowing to the source of the issue. At the end of the day I raised a call at Quantum because it needs to be the drive.

 

So, what did I do ?

1. New script, one client, to tape tracing both network IF

Result: in a 5 sec. interval 7-8 intervals with no traffic, 3-4 with full traffic.

2. Same script but to disc

Result: Wow ! Full speed.

Result2: It can´t be a network issue

3. New script from server disc to tape

Result: Same as network but a little more speed. But still far away from LTO3 speed

4. Ok, give the new 8.1.626 a try.

Result: Same as 6.1.

5. Ok, check SCSI card. New Firmware cannot make worse in the current situation. Reduced SCSI speed and disabled Tagged QUeuing.

Result: Nothing changed.

6. Ok, once again. New LTO3 media, drive cleaned, Xserve and Loader restarted.

Result: Nothing changed.

7. to 15. Ok, let´s check compression. Various combinations of hardware/software/no compression both 6.1 and 8.1

Result: No change.

 

As everything started with the reporting of bad media or cleaning and the day after with a bad cleaning tape ... there is nothing else to check for the moment than the tape drive itself. Unfortunately I have just one loader onhand ;)

 

Good thing from today ... I got a bit more familiar with 8.1 and everything I did worked as expected, except for the speed of the tape. And it was quiet stable. I didn´t check the features I was missing in the past release though they will not be there.

 

I will keep you posted once I received feedback from Quantum.

Link to comment
Share on other sites

So, here is the update of the current status.

 

Unfortunately I couldn´t manage to run the diagnostics as the suites are only available for Win/Linux/Solaris and I couldn´t engage a Windows machine.

So finally I received a replacement unit which was immediately tested with a sample script on a new LTO3 medium. It just ran fine. All other scripts ran fine, too, except I received a load error. Will keep an eye on it if it happens again.

I would really like to know, how and why a basically new drive could become defect suddenly, but not completely broken but "just" slow as hell.

But a good thing is, that I know that the support from Quantum is really reliable and fast.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...