Jump to content

NAS Volumes Disconnect During Scan


Recommended Posts

Greetings!

Running Retrospect Server 16.6.0 (114) on a MacPro trashcan updated to Catalina 10.15.3.

Currently there are no 'client' machines, I'm simply backing up 5 shared volumes on a Synology RS2414rp+ (also current DSM) to LTO-6.  I've added the shares via Sources using 'afp://myFileServer.local/Share_Name', and set it to log in using a registered user with admin privileges.  The shares all show up in Sources with the green status indicator.  Backup scripts are set up to run after hours, and I've got a 5-member media set, two blanks, and a cleaning tape in my 8 slot robot.  All Good, right?  Nope.

 

Shortly after a script starts, Retro begins scanning.  Then it hangs and I get a macOS system alert saying the server is shutting down.  After that, my Retro log fills with 'file not found' errors and nothing gets backed up.  Checking the Sources pane after this, the target volume shows as disconnected.  Status goes green again via either Browse button or Locate.

 

Any ideas?

Thanks in advance!

Link to comment
Share on other sites

Doug_M,

In regard to Catalina, you need to read this Knowledge Base article and also this one.  Note that the first KB article, in its last multi-sentence paragraph, erroneously links to the Mojave version of the second KB article rather than the Catalina version—which it should have (except that somebody on Retrospect "Inc."'s. august Documentation Committee didn't coordinate what he wrote in the first article with somebody else who was rewriting the second article for Catalina).

I'd guess your problem is covered under "Technical Details" at the bottom of the second Knowledge Base article I linked to in the first sentence of this post.  That article says Network Volumes are among the "folders" [the article's term—originally less sloppy when the first KB article article was originally written to apply to Mojave] "that require 'Application Data Privacy' ('Full Disk Access') privileges".  A sentence in the first KB article's last section says the same thing.

Link to comment
Share on other sites

Since I'm in testing mode with the latest RS...

I've just backed up a latest-DSM share over AFP, so it's possible. That you are seeing scanning start (and no "Retrospect has detected it is not listed under "Full Disk Access") suggests that FDA is enabled. That RS then cannot find the files it has scanned to then back them up suggests a network disconnect between those two phases -- possibly triggered by whatever is throwing the shutdown alert, though it could be that the alert is the result of the disconnect. What's the actual alert text?

Is your test a manual run, rather than sitting waiting to watch the out-of-hours schedule? What else is the trashcan doing? Any task or power scheduling on the Mac or Synology? Any non-default options/preferences in your script or RS server?

Link to comment
Share on other sites

Thanks for the replies!  Full Disk Access is indeed enabled for both the Engine and Instant Scan, the Synology doesn't sleep, and neither does the trashcan.

These are scheduled scripts, but I have kicked them off manually to test.  The only other thing going on with the Retro server is a time machine backup of the boot drive, to another share on the same fileserver, which I've disabled temporarily for testing.  Everything in the scripts is default.  I've checked the Console, and logs in Catalina are now very different and in this case useless, the logs on the Synology don't show anything unusual... successful connections to the specified shares, but nothing about disconnects.

The maddening thing is that I only get the last 500 lines of the Retro log, and I'd like to see the whole thing.  What I can see is mostly this:

*good path to actual file/filename - can't read, error -1,101 (file/directory not found)

And then the script closes with this:

 2/20/20 7:38:15 PM: Building Snapshot...
    !Can't read state information, error -1,101 (file/directory not found)
    2/20/20 7:38:17 PM: Execution incomplete
    No longer present: 4,117 files, 219.2 GB
    Completed: 10 files, 0 B
    Performance: 0 MB/minute
    Duration: 01:38:14 (01:37:41 idle/loading/preparing)
Link to comment
Share on other sites

One other detail - the only thing between the two units is a network switch.  They live next to each other in the rack.

The Synology and switch are set up for Dynamic Link Aggregation (IEEE 802.3ad LACP) bonding all four ports on the 2414, and functions perfectly in every other scenario.

 

Looking at the 'Past Backups' tab, the Status alert says Backup Interrupted.

Link to comment
Share on other sites

Doug_M,

This is just a stab in the dark, but make sure Instant Scan is Disabled in the Retrospect  Engine Systems Preferences pane per the screenshot at the bottom of page 206 of the Retrospect Mac 16 User's Guide.  This setting is Enabled in my copy of Retrospect Mac 16.1 (I haven't gone to 16.6 out of lack of use of Catalina and abject fear of hurried foul-ups by StorCentric-harried engineers).

FYI Instant Scan is described on page 203 of the UG, but it only has ever worked for local or "client" drives running under Windows or macOS—not any fileservers running under other OSes.  Instant Scan still works for Mac drives formatted with HFS+, so I have it enabled in the Retrospect Engine for three local drives that are mounted on my "backup server" machine.  However Instant Scan's been officially declared by Retrospect "Inc." (see "Known Issues") not to work for Mac drives formatted with the newer and more-glorious APFS, so I've disabled it for my MacBook Pro "client" Source—whose SSD was (despite my best efforts) forcibly converted to APFS when my MBP was upgraded to macOS 10.13 High Sierra.  Maybe 16.6 has a bug enabling server Instant Scan.

 

Link to comment
Share on other sites

Hey David!  I saw that, too.  I've had Instant Scan disabled in System Prefs/Retrospect pane, though it does have Full Disk Access permission.  As the NAS can't run Client in a sanctioned way (I'm not installing the Linux client), I was under the impression that Instant Scan would simply ignore that volume regardless.  I switched it off anyway, with no apparent difference.

 

P.S. @Nigel Smith - the alert is a simple system alert dialog with yellow triangle (!)  "Server is shutting down.  The file server "mySynologyNAS" is shutting down. [OK] 

Link to comment
Share on other sites

Doug_M,

To increase the size of the Retrospect Mac Console Log, see pages 155-156 of the Retrospect Mac 16 User's Guide.  However you may not even need to do that, because the imprecisely-written (doesn't distinguish between the Console Log and the Operations log) paragraph says

Quote

When the log reaches the limit set, the oldest entries are no longer shown, though they remain in the operations_log.utx file stored in/Library/Application Support/Retrospect/, up to the maximum log size specified in General Preferences (see “Log size limit” below).  You can view the operations log by choosing View > Log, or by pressing Cmd-L. Type the length you want for the log in the entry field.

That Operations log—of which the latest portion is copied into the Console log—size limit can be increased per page 157 of the UG.

When you say "the only thing between the two units is a network switch", I assume the two units are your Mac Pro and your  Synology.  You might also consider playing around with "Advanced Settings" on pages 161-163 of the UG.  Or maybe (dare I say it?) switch from AFP to SMB.😲

Link to comment
Share on other sites

On 2/21/2020 at 5:45 PM, Doug_M said:

P.S. @Nigel Smith - the alert is a simple system alert dialog with yellow triangle (!)  "Server is shutting down.  The file server "mySynologyNAS" is shutting down. [OK] 

So it's the Synology that's "shutting down", and not the Mac. Quite a useful data point!

Still sounds like a network problem -- AFP is a lot more resilient than Retrospect, and can retry/recover from a network flutter when RS will just say "Nope, server's gone", because the OS-recovered connection will often be named differently in /Volumes.

Try manually mounting the share on the Mac (rather than adding it to RS as a server and letting RS login) and backing up as a local volume. And as David suggests, unless you have a pressing need for AFP, try using SMB. Also try taking the switch and aggregation out of the equation by running a single direct connection from Synology to trashcan. I'd do this as disk-based backups, so as not to waste time/tape -- maybe also by creating a small-sized "Favourite" on the Synology to save time, then expanding the backup to a full share, then all the shares...

Basically, start as simple as possible and see if it works -- then add complexity until it breaks!

Link to comment
Share on other sites

On 2/24/2020 at 6:19 AM, Nigel Smith said:

Try manually mounting the share on the Mac (rather than adding it to RS as a server and letting RS login) and backing up as a local volume. And as David suggests, unless you have a pressing need for AFP, try using SMB. Also try taking the switch and aggregation out of the equation by running a single direct connection from Synology to trashcan.

I'll give the local volume backup a try.  I've not been using SMB due to name mangling, which is partly due to an smb.conf setting on the NAS, and years of Mac users not following our file/folder naming protocols.  That, coupled with the default performance hit due to signing, is why I'm pursuing AFP.  Additionally, the volumes in question are on our live production server, so testing some of these changes would require more down-time than I can get.

 

Here's what I'm seeing in Console.app during the hang / reconnect:


AFP_VFS afpfs_DoReconnect:  Restoring session /Volumes/Test_Volume
ASP_TCP ReplayPendingReqs: replaying slot 122 with reqID 5967 afpCmd 72 on so 0x62f0e214331bf2ef
ASP_TCP ReplayPendingReqs: replaying slot 123 with reqID 5968 afpCmd 72 on so 0x62f0e214331bf2ef
ASP_TCP ReplayPendingReqs: replaying slot 124 with reqID 5969 afpCmd 72 on so 0x62f0e214331bf2ef
AFP_VFS afpfs_DoReconnect:  get the reconnect token
ASP_TCP CancelOneRequest: cancelling slot 83 error 32 reqID 29780 flags 0x9 afpCmd 0x40 so 0x62f0e214331bf2ef
ASP_TCP Disconnect: triggering reconnect by bumping reconnTrigger from curr value 3993 on so 0x62f0e214331bf2ef
AFP_VFS afpfs_DoReconnect:  GetReconnectToken failed 32 /Volumes/Test_Volume
AFP_VFS afpfs_DoReconnect:  doing reconnect on /Volumes/Test_Volume
AFP_VFS afpfs_DoReconnect:  soft mounted and hidden volume so do not notify KEA for /Volumes/Test_Volume
AFP_VFS afpfs_DoReconnect:  Max reconnect time: 30 secs, Connect timeout: 15 secs for /Volumes/Test_Volume
AFP_VFS afpfs_DoReconnect:  connect to the server /Volumes/Test_Volume
Ignoring source 0x7fdf9641ca40, snapshot is from kernel kernel_task TCP4 flow id 44472 (close) so pkts rx 6 tx 9, bytes 456 845 cell 0 0 wifi 0 0 wired 696 1205 deltas 456 845 0 0 0 0 696 1205
AFP_VFS afpfs_DoReconnect:  Logging in with uam 8 /Volumes/Test_Volume
AFP_VFS afpfs_DoReconnect:  Restoring session /Volumes/Test_Volume

 

These are generated by 'kernel', and repeats several times every second - it's a flood of reconnect attempts.  Funny thing is that everyone here uses AFP, most are running Catalina, and we have no issues with availability except in this instance.

 

And here's what I'm seeing in my Retrospect test script log when it actually makes it through (rare):

 

+  Normal backup using NAS TEST at 2/21/20, 4:13:48 PM (Activity Thread 1)

    To Backup Set TestSet...

    -  2/21/20 4:13:48 PM: Copying Marketing_Sales

    [*] 2/21/20, 4:17:53 PM: Not done after 30s:

    [*]     0:/Volumes/Marketing_Sales/Production for Proposals/2018/JOB 18-1997/01 Accounting/

    [*]     1:/Volumes/Marketing_Sales/Production for Proposals/2018/JOB 18-1997/Proposal/

    [*]     2:/Volumes/Marketing_Sales/Production for Proposals/2018/JOB 18-1997/00 Project Management/

    [*]   > 3:/Volumes/Marketing_Sales/Production for Proposals/2018/JOB 18-2101/00 Project Management/

    [*] 2/21/20, 4:20:55 PM: mvlcProcessScanNodes resumed after 212s

    *File "/Volumes/Marketing_Sales/Production for Proposals/Other Folder”: can't read, error -1,101 (file/directory not found)

    *File "/Volumes/Marketing_Sales/Production for Proposals/Yet Another Folder“: can't read, error -1,101 (file/directory not found)

    2/21/20 4:20:58 PM: Found: 19188 files, 8200 folders, 198.2 GB

    2/21/20 4:20:59 PM: Finished matching

    *File "/Volumes/Marketing_Sales/Production for Proposals/2018/JOB 18-1997/01 Accounting/Invoices/": can't read, error -1101 (file/directory not found)

    and 8,297 others

    2/21/20 4:20:59 PM: Selector "No Files" was used to select 0 files out of 19,188.

    2/21/20 4:20:59 PM: Copying: 0 files (0 bytes) and 0 hard links

    2/21/20 4:21:02 PM: Building Snapshot...

    !Can't read state information, error -1,101 (file/directory not found)

    2/21/20 4:21:02 PM: Execution incomplete

    Duration: 00:07:14 (00:07:12 idle/loading/preparing)

Link to comment
Share on other sites

Doug_M,

You haven't said whether you have a brand-new installation of Retrospect Mac, or whether you've been using it for years and something's changed—including an upgrade to Retrospect Mac 16.

If you recently installed/upgraded to Retrospect Mac 16, I believe you're entitled to 30/45 days of free personalized support.  Here's why and how to file a Support Case for a bug, in which you'll have to disclose those trivial details.  In case you don't realize it, everyone on these Forums is a volunteer; no Retrospect "Inc." employees post here.

 

Link to comment
Share on other sites

Hey David!

It's an old installation, and had been working reliably.  Things got weird when the Retrospect Server machine was updated to Catalina / Retrospect Mac 16 (neither of those updates were my decision, in case you're wondering ;) ).  All of this happened over the New Year break, and I stepped in to sort it out last week.

 

I'm fully aware that there aren't any employees on these boards, but there are some knowledgeable folks lurking about... and since the 30/45 day support thing has expired, I thought I'd try my luck.

Thanks for the links!

Link to comment
Share on other sites

Doug_M,

Upgraded to Retrospect Mac 16 from what version?  As I've said in the third paragraph of this post in another thread, "my impression from several recent Forums threads is that the 'backup server' Engine—which is the same for both variants of Retrospect aside from GUI—lost its ability in version 15 to handle certain old versions of network protocols used by NASes."

Pleas look again at the fifth paragraph in the boilerplate Support Case post I linked to earlier in this thread.  Retrospect Tech Support may no longer give you personalized support, since your installation upgraded more than 45 days ago, but they've probably got a generalized response they can send you if you file a Support Case about the bug—which as I said in that fifth paragraph you're entitled to do. 

Besides, they may take pity on you because the updating was done over the New Year break—probably by ignoramuses from IT management.🤣  To enhance that pity I suggest you contact Retrospect Sales by phone, and threaten to procure another backup application to backup your live production server.  In the meantime, and to enhance the credibility of your threat to Sales, I suggest you put another Mac or Windows machine—running the older version of Retrospect that used to work—into service backing up your production NAS.

 

Link to comment
Share on other sites

On 2/26/2020 at 6:29 PM, Doug_M said:

I've not been using SMB due to name mangling, which is partly due to an smb.conf setting on the NAS, and years of Mac users not following our file/folder naming protocols.

Yep -- that's a pressing need (I'm dealing with similar at the moment, and have had to use the same AFP work-round).

On 2/26/2020 at 6:29 PM, Doug_M said:

Additionally, the volumes in question are on our live production server, so testing some of these changes would require more down-time than I can get.

Shouldn't need much, if any. Volumes definitions, direct mounting, etc, can be be done on the RS server without impacting the Synology.

I'd hoped your aggregated link would be resilient enough that you could simply re-route one cable to the trashcan instead of the switch and then change the network settings, but I've just done a quick test and it looks like you can't just remove a NIC from the LACP bond -- you'll have to delete the Bond then re-create it with three NICs. Maybe 5 minutes of downtime.

Did you start new media sets when you upgraded? If so, it could be that the "old" system was skipping already backed-up files which the new system is now trying to first-time access -- only there's a problem on the Synology. Have you run Data Scrubbing (Storage Manager->Storage Pool->Data Scrubbing) lately? I'd be happier doing that after a complete, successful, backup, but needs must when the devil drives...

Link to comment
Share on other sites

  • 4 months later...

Resurrecting this old thread - our office was hit by a tornado just after midnight March 2.  Many temporary workarounds have been in use and we have moved.  Just getting the hardware back online - new switch, new Synology box, same data and shares - and (perhaps unsurprisingly) the problem persists.

I've been running a test set for troubleshooting.  Retrospect is latest v16 (haven't looked at upgrading to 17 yet), trashcan is on Catalina 10.15.6, Synology DSM is current, data scrubbing will occur quarterly and was last run after the transfer to the new NAS.

This really feels like a network stack bug or similar, but only happens with Retrospect.  Back to the logs!

Link to comment
Share on other sites

On 7/28/2020 at 8:01 PM, Doug_M said:

Resurrecting this old thread - our office was hit by a tornado just after midnight March 2.

As excuses go, that's definitely in the top three! I hope all the important things -- the people! -- were undamaged.

On 7/28/2020 at 8:01 PM, Doug_M said:

This really feels like a network stack bug or similar, but only happens with Retrospect.

Have you tried mounting the share on the Mac (still using AFP) and treating it as a local volume, rather than letting RS do it?

Also this thread throws up an interesting suggestion -- use the IP address rather than relying on Bonjour resolution (static IP on the server an advantage, obv).

Link to comment
Share on other sites

Hey Nigel,

Fortunately, it was after midnight when the tornado went through and was only property damage.  We fared much better than most of our neighbors, though.

I did try mounting the share first, but it still showed as such in the Sources pane.  I've been using the IP address for these connections, which doesn't seem to improve my situation.  I tried using SMB again, and that seemed to get further into the process, but there are still major issues with name mangling and files not being found after the initial scan.  I get the impression that SMB is functioning more robustly than AFP, but until I can resolve the mangling issue it's a non-starter.

Link to comment
Share on other sites

44 minutes ago, Doug_M said:

I did try mounting the share first, but it still showed as such in the Sources pane.

It'll still show as a share (because it is!) but you should see a difference in the "Name" and "Machine" columns of the Shares list -- a share mounted via the server's OS will show the volume name in "Name" and the server's name in "Machine", a share mounted by RS will show "user@IP_or_FQDN/share" and either IP, FQDN or name.local depending on how you addressed it.

We're trying to make sure that if you are using SMB you are doing so via OS X's smbd -- I don't know if RS uses that or its own routines, but I do know that OS X's smbd does lots of trickery to encode/decode funky Mac filenames for use on less supportive SMB shares.

Have you got any examples of filenames that are failing? I can try and replicate your problem, albeit using RS17 rather than 16, if you think it might help.

Link to comment
Share on other sites

I gotcha.  Okay, so I'm getting "Share_Name' for Name, and the server's IP address under Machine.  So it looks like OS-level mount.
Side note:  I get 'Unknown' for the shares in the OS column.

Here's an example folder name that gets mangled:
AFP - /Animations (Do we need to rerender smaller?)/
SMB - /AVFA3C~5/

I'm not entirely certain that the mangling is the reason for the -1101 errors though.

EDIT:  In the logs, the only files that were tagged as missing were in some folder with a mangled name (we have many).  FWIW.

Link to comment
Share on other sites

10 hours ago, Doug_M said:

Here's an example folder name that gets mangled:
AFP - /Animations (Do we need to rerender smaller?)/
SMB - /AVFA3C~5/

SMB is using the Windows short name because the name isn't being encoded for compatibility by the OS X SMB daemon -- and that's because the data is being put on the server by AFP clients. The permanent fix to this is:

  1. Lock out all your users
  2. On Mac Pro, mount the share using "afp://IP_address/sharename"
  3. Still on the Pro, mount the share again using "smb://IP_address/sharename"
  4. Move the troublesome data from the AFP-share window to the SMB-share window
  5. Turn off AFP on the Synology and force your users to SMB from now on, and maybe look at how to turn off signing in SMB config

Step 4's the tricky bit -- set the windows to different views so you can keep track of which is which and decide how you are going to manage the transfer wrt clashing names, deletion of "AFP versions" when complete, etc. I'd probably do one folder at a time, copy that into a folder at the same level named "Transfer", delete the original, move the contents of "Transfer" to where the original was.

(All the above just tested and confirmed with Catalina, an up-to-date Synolgy, a folder on the Mac named as yours with contents created by 

echo "Here's some text in a file" | tee "Animation (smaller)?"{0..9}.txt > /dev/null

...with an additional "testFile.txt" file to make sure "normal" files were also OK, and then copied to the Syn using AFP.)

Or you could insist that everyone stays using, and only using, AFP, and pray it remains supported for as long as you need it...

11 hours ago, Doug_M said:

Okay, so I'm getting "Share_Name' for Name, and the server's IP address under Machine.  So it looks like OS-level mount.

I may have given you a bum steer -- it looks like (for AFP and SMB at least) RS is using the system APIs to mount shares. If I turn the RS engine off and restart, my server shows no network mounts in /Volumes. If I turn on the RS engine the shares connected to via RS's Sources then appear in /Volumes, and if I add another share there it appears in /Volumes too. RS adds them as root/wheel while mounting via the Finder (non-root account) adds them as admin/staff -- entirely reasonable.

Both work as expected from above -- SMB-mounted shows AFP-added files by short name and SMB-added files by full name complete with non-standard characters. AFP-mounted shows both versions with their full names. Both mount methods back up all files without error, so it doesn't look like the filenames are causing your errors in and of themselves.

I still think you are losing connection with the server for some (unrelated to filenames) reason. As said before, RS backup-in-progress is much more sensitive to temporary disconnects than "normal" file sharing is, so it may be that your other users simply don't notice when it happens. It sounds like you have some time for testing -- if so I'd suggest you go back to scratch. Static IP on just one Syn interface, static on Mac Pro, direct cable connection between the two, does the problem still occur? If not, slowly add complexity until it does.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...