Project

General

Profile

Actions

Regression #14181

closed

``mmcsd0`` controller timeout/system hang on 1100

Added by Craig Leres over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Operating System
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Release Notes:
Default
Affected Plus Version:
23.01
Affected Architecture:
SG-1100

Description

Several times since upgrading to 23.05 and later reinstalling to switch to zfs root I've had a SG-1100 glitch and lose the ability to talk to the filesystem. I found similar reports but about SD cards where the recommendation was to disable them. Which obviously doesn't help in this case since we're talking about the boot device.

I don't remember seeing this prior to 23.05/zfs. Anybody else experiencing this? Am I looking at a hardware or OS issue?

I've attached two serial console stack traces.


Files

1.log (1.52 KB) 1.log Craig Leres, 03/25/2023 12:20 PM
2.log (1.31 KB) 2.log Craig Leres, 03/25/2023 12:20 PM
3.log (1.18 KB) 3.log Craig Leres, 03/25/2023 02:25 PM

Related issues

Has duplicate Regression #14300: Re: ``mmcsd0`` controller timeout/system hang on 1100Duplicate

Actions
Actions #1

Updated by Craig Leres over 1 year ago

Craig Leres wrote:

I've attached two serial console stack traces.

Here's one more crash from a few minutes ago, that's the 2nd in about 12 hours. I guess I'll swap to my spare SG-1100...

Actions #2

Updated by Craig Leres over 1 year ago

Oops, I'm actually running 23.01.

Actions #3

Updated by Kris Phillips over 1 year ago

I haven't seen this with any other firewalls or on my personal Netgate 1100. I suspect you might have a fault eMMC that is starting to go. Do you have the same issue on UFS?

Actions #4

Updated by Craig Leres over 1 year ago

Well I'm running on a completely different SG-1100 now so I'll wait and see if the problem reoccurs before the next version.

Actions #5

Updated by Jim Pingle over 1 year ago

  • Tracker changed from Bug to Regression
  • Subject changed from mmcsd0 controller timeout/system hang to ``mmcsd0`` controller timeout/system hang on 1100, possibly hardware related
  • Affected Plus Version changed from 23.05 to 23.01

I have seen the same thing on my 1100 but given the timing (could be hours, days, or even weeks between timeouts) it feels more like hardware to me. I'm also using ZFS here. I have mine on a PDU I can control remotely so I just power it off and back on and it will run again for another period of time. I can't 100% say it's hardware but that seems most likely given that some people never see this and a handful of other people see it at seemingly random intervals.

This may be a coincidence but I haven't seen a timeout in the last two weeks or so, the main recent things I did were a complete wipe+reload and also I ran zpool upgrade -a, both of those were done as a part of testing unrelated things internally, so it's possible it's a complete coincidence. As long as you have backups and a recovery image handy, that should be safe to try and at least see if it makes a difference. I don't see why it would, but there isn't anything to lose by trying it.

Even though it's potentially hardware-related I'll leave this open for a bit in case the other devs notice a pattern that might indicate it's driver related instead.

Actions #6

Updated by Craig Leres over 1 year ago

I found mmc-utils but I'm sure if it can tell me about the health of the flash. What else can I do to test it? I remember from days before smart when the best you could do is run dd if=/dev/da0 of=/dev/null and watch the kernel messages for errors.

I see the chip is a Sanddisk SDIN7DP2-8G and that it's no longer available. I have zero experience with BGA but I've got a nice hot air rework station and it looks like it would only cost me $10 to try swapping in a modern version...

Actions #7

Updated by Jim Pingle over 1 year ago

For what it's worth I still have not seen a timeout again on mine, but I've been running 23.05 snapshots. It's been up for 9 days straight now. If this is a driver issue it's possible that it's been fixed upstream since 23.05 snapshots contain much more recent code from FreeBSD 14 than 23.01 has.

Actions #8

Updated by Jim Pingle over 1 year ago

  • Subject changed from ``mmcsd0`` controller timeout/system hang on 1100, possibly hardware related to ``mmcsd0`` controller timeout/system hang on 1100
  • Status changed from New to Closed
  • Assignee set to Jim Pingle

Another update after another 2 weeks on 23.05 with my 1100, still have yet to see another timeout. It was happening fairly regularly on 23.01 on this same device so I'm hoping that does mean it was a driver issue addressed upstream.

At this point, if it is a driver issue it's been solved, and if it's hardware it's not something we can fix with an update, so either way there is nothing actionable to keep this issue open and waiting on.

We'll have public 23.05 snapshots coming soon in advance of the release, when they are available you can upgrade and see if things stabilize for you as well.

Actions #9

Updated by Jim Pingle over 1 year ago

  • Has duplicate Regression #14300: Re: ``mmcsd0`` controller timeout/system hang on 1100 added
Actions

Also available in: Atom PDF