Project

General

Profile

Bug #6340

fsck hangs boot in background, fails to produce any action, resulting in broken firewall

Added by Kill Bill over 1 year ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Category:
Operating System
Target version:
Start date:
05/10/2016
Due date:
% Done:

0%

Affected Version:
2.2.x
Affected Architecture:
All

Description

Hi, guys! (Yeah, still alive, eventually might get back to fixing something in pfSense. :-P)

Meanwhile - this fsck thing completely fails yet again... After a power failure (AMD64 full install, SSD HDD, 2.3.1 development branch, though that's very much irrelevant):

Starting syslog...done.
Starting Secure Shell Services...done.
Setting up polling defaults...done.
Setting up interfaces microcode...done.
Configuring loopback interface...done.
Creating wireless clone interfaces...done.
Configuring LAGG interfaces...done.
Configuring VLAN interfaces...done.
Configuring QinQ interfaces...done.
Configuring WAN interface...done.
Configuring LAN interface...done.
Configuring MODEMACCESS interface...done.
Configuring HEIPV6 interface...done.
Configuring CARP settings...done.
Syncing OpenVPN settings...done.

Here it hangs. Forever, until you press CTRL+C and type exit. After that:

^CEnter full pathname of shell or RETURN for /bin/sh:
# exit
Configuring crash dumps...
Using /dev/label/swap0 for dump device.
/dev/ufsid/567c0b01685c651a: NO WRITE ACCESS
/dev/ufsid/567c0b01685c651a: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Preen mode recommended running a check that will be performed now.
WARNING: Trying to recover filesystem from inconsistency...
** /dev/ufsid/567c0b01685c651a (NO WRITE)

USE JOURNAL? no

** Skipping journal, falling through to full fsck

SETTING DIRTY FLAG IN READ_ONLY MODE

UNEXPECTED SOFT UPDATE INCONSISTENCY
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
UNREF FILE I=5377154  OWNER=root MODE=100666
SIZE=0 MTIME=May 10 00:17 2016
CLEAR? no

** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? no

SUMMARY INFORMATION BAD
SALVAGE? no

BLK(S) MISSING IN BIT MAPS
SALVAGE? no

324518 files, 886602 used, 12219169 free (8033 frags, 1526392 blocks, 0.1% fragmentation)

This will go on and on and on after every reboot. Broken connectivity, services not running, generally a huge mess. Unless you make the thing actually do something:

# touch /root/force_fsck
# reboot

After that:

Configuring crash dumps...
Using /dev/label/swap0 for dump device.
Forcing filesystem(s) check...
** /dev/ufsid/567c0b01685c651a

USE JOURNAL? yes

** SU+J Recovering /dev/ufsid/567c0b01685c651a
** Reading 33554432 byte journal from inode 4.

RECOVER? yes

** Building recovery table.
** Resolving unreferenced inode list.
** Processing journal entries.

WRITE CHANGES? yes

** 560 journal records in 50688 bytes for 35.35% utilization
** Freed 0 inodes (0 dirs) 89 blocks, and 0 frags.

***** FILE SYSTEM MARKED CLEAN *****
/dev/ufsid/567c0b01685c651a: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/ufsid/567c0b01685c651a: clean, 12229386 free (8034 frags, 1527669 blocks, 0.1% fragmentation)
Filesystems are clean, continuing...
Mounting filesystems...

Normal boot follows; firewall working, routing working, packages started, sanity restored.

     ___
 ___/ f \
/ p \___/ Sense
\___/   \
    \___/

Welcome to pfSense 2.3.1-DEVELOPMENT on the 'pfSense' platform...

No core dumps found.
Creating symlinks......ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/local/lib /usr/local/lib/ipsec /usr/local/lib/mysql /usr/local/lib/nss
32-bit compatibility ldconfig path: /usr/lib32
done.
External config loader 1.0 is now starting... ada0s1 ada0s1aw starting... ada0s1bw starting...
Launching the init system....... done.
Initializing.................. done.
Starting device manager (devd)...done.
Loading configuration......done.
Updating configuration...done.
Cleaning backup cache.................................done.
Setting up extended sysctls...done.
amdtemp0: <AMD CPU On-Die Thermal Sensors> on hostb3
Setting timezone...done.
Configuring loopback interface...done.
Starting syslog...done.
Starting Secure Shell Services...done.
Setting up polling defaults...done.
Setting up interfaces microcode...done.
Configuring loopback interface...done.
Creating wireless clone interfaces...done.
Configuring LAGG interfaces...done.
Configuring VLAN interfaces...done.
Configuring QinQ interfaces...done.
Configuring WAN interface...done.
Configuring LAN interface...done.
Configuring MODEMACCESS interface...done.
Configuring HEIPV6 interface...done.
Configuring CARP settings...done.
Syncing OpenVPN settings...done.
Configuring firewall..DUMMYNET 0 with IPv6 initialized (100409)
load_dn_sched dn_sched FIFO loaded
load_dn_sched dn_sched QFQ loaded
load_dn_sched dn_sched RR loaded
load_dn_sched dn_sched WF2Q+ loaded
load_dn_sched dn_sched PRIO loaded
...0 addresses deleted.
.done.
Starting PFLOG...done.
Setting up gateway monitors...done.
Synchronizing user settings...
done.
Starting webConfigurator...done.
Configuring CRON...done.
Starting DNS Resolver...done.
Starting NTP time client...done.
Configuring firewall.....0 addresses deleted.
0 addresses deleted.
.done.
Configuring IPsec VPN... done
Generating RRD graphs...done.
Starting syslog...done.
Starting CRON... done.
...
pfSense (pfSense) 2.3.1-DEVELOPMENT amd64 Mon May 09 12:21:57 CDT 2016
Bootup complete

Executive summary: UFS + its wannabe fsck -> utter POS.

fsck_test.diff Magnifier (1.43 KB) Renato Botelho, 05/10/2016 02:46 PM

pfsenseStability.csv Magnifier (5.49 KB) brennen smith, 06/05/2017 11:38 PM

Associated revisions

Revision cc82c328
Added by Renato Botelho 10 months ago

Ticket #6340:

- Stop misusing fsck F parameter, it's supposed to be used when you
plan to run background fsck after filesystems are mounted, what is
not the case on pfSense
Increase attempts to mount all filesystems as read-write to 10
- If we cannot mount filesystems as read-write, start a recovery shell
and after it finishes, reboot system

Revision cbafa84f
Added by Renato Botelho 10 months ago

Ticket #6340:

- Stop misusing fsck F parameter, it's supposed to be used when you
plan to run background fsck after filesystems are mounted, what is
not the case on pfSense
Increase attempts to mount all filesystems as read-write to 10
- If we cannot mount filesystems as read-write, start a recovery shell
and after it finishes, reboot system

History

#1 Updated by Renato Botelho over 1 year ago

  • Status changed from New to Confirmed
  • Assignee set to Renato Botelho
  • Target version set to 2.3.1
  • Affected Version changed from 2.3.1 to 2.3

#2 Updated by Jim Pingle over 1 year ago

He's alive! :-)

This sounds like a different failure mode with potentially the same root cause as #5592, which appears to be happening much more often on recent releases than in the past.

This is the first time I've heard of it hanging, though. The usual failure mode is that something panics/crashes (or in some cases, a loss of power), then fsck checks on the next boot and marks it clean despite the filesystem still having errors. After it passes that point, it panics when encountering a bad directory on the disk, which reboots it again, fsck checks, marks it clean but still doesn't find the real problem, so it panics again, lather, rinse, repeat. Though we know it happens, we have not yet found a means to reproduce it on demand.

In those cases the only way around it is to boot single user mode and run fsck repeatedly until it no longer finds any problems. That's hard to do programmatically since fsck appears to exit without an error code when it finds and fixes problems, so it's hard to know it needs a repeat. And running fsck multiple times on each boot would unnecessarily slow down large disks / slower media installs.

#3 Updated by Kill Bill over 1 year ago

Jim Pingle wrote:

He's alive! :-)

Yeah, indeed... :-)

This sounds like a different failure mode with potentially the same root cause as #5592, which appears to be happening much more often on recent releases than in the past.

This is the first time I've heard of it hanging, though. The usual failure mode is that something panics/crashes (or in some cases, a loss of power), then fsck checks on the next boot and marks it clean despite the filesystem still having errors. After it passes that point, it panics when encountering a bad directory on the disk, which reboots it again, fsck checks, marks it clean but still doesn't find the real problem, so it panics again, lather, rinse, repeat. Though we know it happens, we have not yet found a means to reproduce it on demand.

Yeah. It hangs. Now, as I have discovered after whole lot of messing, it hangs reliably on every second reboot, no matter the troubleshooting steps above. It's repeatable with journaling turned off, it's repeatable with SU+J turned off.

In those cases the only way around it is to boot single user mode and run fsck repeatedly until it no longer finds any problems. That's hard to do programmatically since fsck appears to exit without an error code when it finds and fixes problems, so it's hard to know it needs a repeat. And running fsck multiple times on each boot would unnecessarily slow down large disks / slower media installs.

When I boot to single user mode, the fsck finds nothing. No breakage. No matter how many times I try. Then, I reboot and end up with

Using /dev/label/swap0 for dump device.
/dev/ufsid/567c0b01685c651a: NO WRITE ACCESS
/dev/ufsid/567c0b01685c651a: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Preen mode recommended running a check that will be performed now.
WARNING: Trying to recover filesystem from inconsistency...

once again.

For now, I'd appreciate some advise how to make the absolutely broken fsck thing totally no-op. Without it, I'm very much stuck with reinstalling the entire system, because the filesystem check is just completely FUBARed. :-( It fixes nothing, it just causes damage.

#4 Updated by Kill Bill over 1 year ago

Kill Bill wrote:

it hangs reliably on every second reboot, no matter the troubleshooting steps above. It's repeatable with journaling turned off, it's repeatable with SU+J turned off.

So I thought to add

/usr/bin/touch /root/force_fsck

as shellcmd. Does not help. The only way to hack around this seems to be the CTRL+C and exit command. With that, the fsck produces

^CEnter full pathname of shell or RETURN for /bin/sh:
# exit
Configuring crash dumps...
Using /dev/label/swap0 for dump device.
/dev/ufsid/567c0b01685c651a: NO WRITE ACCESS
/dev/ufsid/567c0b01685c651a: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Preen mode recommended running a check that will be performed now.
WARNING: Trying to recover filesystem from inconsistency...
** /dev/ufsid/567c0b01685c651a (NO WRITE)
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
UNREF FILE I=5377154  OWNER=root MODE=100666
SIZE=0 MTIME=May 10 18:00 2016
CLEAR? no

** Phase 5 - Check Cyl groups
324509 files, 867999 used, 12237772 free (6844 frags, 1528866 blocks, 0.1% fragmentation)

and on next reboot it finds no breakage and it works. Now, when I do

# touch /root/force_fsck
# reboot

instead, on next boot I get

Forcing filesystem(s) check...
** /dev/ufsid/567c0b01685c651a
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
324522 files, 868015 used, 12237756 free (6828 frags, 1528866 blocks, 0.1% fragmentation)

***** FILE SYSTEM IS CLEAN *****
/dev/ufsid/567c0b01685c651a: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/ufsid/567c0b01685c651a: clean, 12237756 free (6828 frags, 1528866 blocks, 0.1% fragmentation)
Filesystems are clean, continuing...
Mounting filesystems...

No problem is found, yet it hangs in exactly the same place, producing exactly the same fsck error as posted above after the CTRL+C; exit drill.

UNREF FILE I=5377154  OWNER=root MODE=100666

Just the MTIME of that file gets updated across reboots. This is just absurd...

- fsck like this is not usable at all.
- how do I find which file is using that inode?
- what's the difference between the initial fsck that fails to find anything and the latter one that produces this whole fsck-up!?

Sigh. :-(

#5 Updated by Renato Botelho over 1 year ago

So you have SUJ off, but you have Soft Updates enabled, right? Did you disabled SUJ for any reason or is it an old installation from the time pfSense didn't set it yet?

I'm running a script to attempt to reproduce it on local VM, as soon as I can fully reproduce it I will be able to test a solution. I did a patch to improve a bit the way it's done today, can you try it out and let me know the result of DEBUG lines?

#6 Updated by Jim Thompson over 1 year ago

Renato Botelho wrote:

So you have SUJ off, but you have Soft Updates enabled, right? Did you disabled SUJ for any reason or is it an old installation from the time pfSense didn't set it yet?

If Jakub turned off SUJ, this is a mistake.

#7 Updated by Kill Bill over 1 year ago

Renato Botelho wrote:

So you have SUJ off, but you have Soft Updates enabled, right? Did you disabled SUJ for any reason or is it an old installation from the time pfSense didn't set it yet?

No, this was not any old upgraded install. Happened on a fresh 2.3 full install with SU+J being turned on from the very start. Was done just for testing purposes after lots of attempts to make the damned fsck thing do what's it supposed to do to no avail. Turned off just the journal, made no difference, turned both off, made no difference, turned both back on, made again no difference.

Here's the debug output with the patch applied:

^CEnter full pathname of shell or RETURN for /bin/sh:
# exit
Configuring crash dumps...
Using /dev/label/swap0 for dump device.
/dev/ufsid/567c0b01685c651a: NO WRITE ACCESS
/dev/ufsid/567c0b01685c651a: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Preen mode recommended running a check that will be performed now.
WARNING: Trying to recover filesystem from inconsistency...
** /dev/ufsid/567c0b01685c651a (NO WRITE)

USE JOURNAL? no

** Skipping journal, falling through to full fsck

SETTING DIRTY FLAG IN READ_ONLY MODE

UNEXPECTED SOFT UPDATE INCONSISTENCY
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
UNREF DIR  I=5136450  OWNER=root MODE=40755
SIZE=512 MTIME=May 10 22:27 2016
RECONNECT? no

** Phase 4 - Check Reference Counts
LINK COUNT DIR I=5136450  OWNER=root MODE=40755
SIZE=512 MTIME=May 10 22:27 2016  COUNT 2 SHOULD BE 1
ADJUST? no

UNREF FILE I=5377164  OWNER=root MODE=100666
SIZE=0 MTIME=May 10 22:27 2016
CLEAR? no

** Phase 5 - Check Cyl groups
324510 files, 876310 used, 12229461 free (6853 frags, 1527826 blocks, 0.1% fragmentation)
DEBUG: Run #1 - rc = 0
/dev/ufsid/567c0b01685c651a: NO WRITE ACCESS
/dev/ufsid/567c0b01685c651a: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
DEBUG: (-p) #1 - rc = 8
** /dev/ufsid/567c0b01685c651a (NO WRITE)

USE JOURNAL? no

** Skipping journal, falling through to full fsck

SETTING DIRTY FLAG IN READ_ONLY MODE

UNEXPECTED SOFT UPDATE INCONSISTENCY
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
UNREF FILE I=5377164  OWNER=root MODE=100666
SIZE=0 MTIME=May 10 22:27 2016
CLEAR? no

** Phase 5 - Check Cyl groups
324510 files, 876310 used, 12229461 free (6853 frags, 1527826 blocks, 0.1% fragmentation)
DEBUG: Run #2 - rc = 0
/dev/ufsid/567c0b01685c651a: NO WRITE ACCESS
/dev/ufsid/567c0b01685c651a: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
DEBUG: (-p) #2 - rc = 8
** /dev/ufsid/567c0b01685c651a (NO WRITE)

USE JOURNAL? no

** Skipping journal, falling through to full fsck

SETTING DIRTY FLAG IN READ_ONLY MODE

UNEXPECTED SOFT UPDATE INCONSISTENCY
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
UNREF FILE I=5377164  OWNER=root MODE=100666
SIZE=0 MTIME=May 10 22:27 2016
CLEAR? no

** Phase 5 - Check Cyl groups
324510 files, 876310 used, 12229461 free (6853 frags, 1527826 blocks, 0.1% fragmentation)
DEBUG: Run #3 - rc = 0
/dev/ufsid/567c0b01685c651a: NO WRITE ACCESS
/dev/ufsid/567c0b01685c651a: UN.
DEBUG: (-p) #3 - rc = 8
Automatic filesystem recovery failed. Starting recovery shell!

Don't think the fsck will ever produce any useful fixes, keeping the b0rked system just for debugging now, will reinstall soon.

#8 Updated by Jim Thompson over 1 year ago

Kill Bill wrote:

No, this was not any old upgraded install. Happened on a fresh 2.3 full install with SU+J being turned on from the very start.

We did an enormous amount of testing and came to a stable solution.

Running outside that solution is not recommended.

#9 Updated by Kill Bill over 1 year ago

Yeah, I understand it's not recommended or supported or whatever, however the stable solution (which was in place when the damage happened) does not work here apparently. The only stable thing for me was ZFS - which, sadly, ain't doable any more with 2.3

Anyway, after a couple more fsck runs in single user mode, it looks like the thing is actually producing just additional damage instead of fixing things; this fsck is just absurdly broken and completely unreliable. How's this possible with such mature systems like FreeBSD goes beyond me. :/

#10 Updated by Jan Jurkus over 1 year ago

Kill Bill wrote:

The only stable thing for me was ZFS - which, sadly, ain't doable any more with 2.3

Why wouldn't this be possible anymore? I have pfSense running on a ZFS pool with ashift=12. The trick is to create a ZFS pool in FreeBSD, and import that while manually installing pfSense. Or, also possible, create a ZFS pool in FreeBSD, and copy all pfSense files from the pfSense iso mounted on a different virtual cd drive (ipmi can be very handy). It's probably not ideal, but at least I kinda knew what I was doing. (with emphasis on kinda)

#11 Updated by Chris Buechler over 1 year ago

doktornotor: welcome back. What's your wireless config like on there? What type of card and config. Not that it should matter, but piecing together some of the reports of this, it seems like the systems that can most readily replicate have wireless cards.

Jan Jurkus wrote:

Kill Bill wrote:

The only stable thing for me was ZFS - which, sadly, ain't doable any more with 2.3

Why wouldn't this be possible anymore?

He's referring to the web installer, which was removed in one of the 2.2.x versions.

#12 Updated by Chris Buechler over 1 year ago

  • Affected Version changed from 2.3 to 2.2.x

I struck out on attempting with various wireless configs in a few dozen power cycles. Also tried writing out a slew of files and directories and pulling power in the middle.

#!/bin/sh

rm -rf /var/dirs

for i in `seq 1 1000`
do
        DIR=`cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c 10`
        echo mkdir -p /var/dirs/$DIR
        mkdir -p /var/dirs/$DIR
        echo `cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c 1000` > /var/dirs/$DIR/$DIR
        sync
done

without the sync in there you end up with none of the files or directories on bootup post-power cycle.

Tried same script as that, with a forced kernel panic at the end (sysctl debug.kdb.panic=1) as it's seemed at times that makes it unrecoverable eventually. No luck yet at least.

Two systems running in the power cycle test rig again. One with the above shell script in startup with power loss somewhere in the middle of the 1000 loops, and a second system running same script with 100 tries and a forced kernel panic with the power cycle coming shortly after the panic.

I've been through upwards of a couple hundred power cycles in the past 12 hours across a range of hardware with 0 failures. 2.3 went through several thousand power cycles in the middle of write_config in a loop with 0 failures.

There are definitely issues here, somewhat along the lines of #5922. But I don't think this is anything worse than we've had on any FreeBSD 10.x base version.

A few dozen power cycles an hour in progress.

affected: 2.2.x+

#13 Updated by Jan Jurkus over 1 year ago

Chris Buechler wrote:

He's referring to the web installer, which was removed in one of the 2.2.x versions.

Ah, that's why.

When talking about power failures it might be an issue of the used hardware. Read for instance this test about SSD and power losses:
http://lkcl.net/reports/ssd_analysis.html
It might be a contributing factor here.

#14 Updated by Kill Bill over 1 year ago

Chris Buechler wrote:

doktornotor: welcome back. What's your wireless config like on there? What type of card and config. Not that it should matter, but piecing together some of the reports of this, it seems like the systems that can most readily replicate have wireless cards.

There is no wireless card in there. Onboard WAN (PPPoE plus IPv6 tunnel) and Gbit Intel LAN, that's all.

Jan Jurkus wrote:

Read for instance this test about SSD and power losses:
http://lkcl.net/reports/ssd_analysis.html
It might be a contributing factor here.

The SSD is Transcend SSD370S. Seriously, the main issues I can see here:

- the system works just fine with absolutely no errors when it actually boots. So, a minor filesystem issue is breaking the entire system due to the fsck stupidity.
- UFS's fsck is just utterly useless; its just random result generator, at best. Fails to find errors, fixes nothing, causes additional breakage. When you have a garbage fsck like this, you either need to skip it altogether, or switch to a different filesystem.

#15 Updated by Chris Buechler over 1 year ago

  • Target version changed from 2.3.1 to 2.3.2

1000+ power cycles later in above described case, no issues. This isn't any worse than it's been in any 10.x base version and we're short on time to make 2.3.1 for any changes here.

Kill Bill: since you seem to be able to look at it crosseyed and it does what I can't get it to do in many thousands of power cycles. If you dd an affected disk to img (boot a live USB of some sort, or plug disk into a diff box), and boot that img in a VM, does the resulting VM have the same issue?

I got a slow CF card into a bad situation like this a few months back, dd'ed it to an img, and booted it in KVM and bhyve. Both of the VMs ran fsck normally at boot and fixed the disk automatically, no problem. The actual CF, when booted, would kernel panic after fsck ran and marked the disk as clean. But boot it to single user, run fsck 3-4 times, and it works. So in that case at least, somehow an img of the CF behaved differently from the actual CF, yet it was a problem fsck could fix if run enough times.

What we really need is an image of a disk or snapshot of a VM that's in a bad state. It's way too difficult to replicate to be able to test changes otherwise.

#16 Updated by Kill Bill over 1 year ago

All this took to cause apparently fatal damage to the "robust" UFS was a single power failure. Since the SSD is just 64GiB and most of it is empty, I might be able to upload the image somewhere, eventually.

Kindly provide a choice of filesystem on install - every single Linux distro can do that, why's that mission impossible here? My trust in UFS has reached a freezing point. And the trust in the fsck is at absolute zero. Seriously noone upstream cares about fixing this piece of utter garbage? Bugs filed and rotting there for years and years.

#17 Updated by Chris Buechler over 1 year ago

  • Target version changed from 2.3.2 to 2.3.1

Kill Bill wrote:

Kindly provide a choice of filesystem on install - every single Linux distro can do that, why's that mission impossible here?

You've never looked at the bsdinstaller code I take it, or you wouldn't ask. ;) Replacing the installer is on the road map, and ZFS will be available at that time.

Seriously noone upstream cares about fixing this piece of utter garbage? Bugs filed and rotting there for years and years.

Have to be able to replicate it to fix it, and it's not easily replicable at all. I have no doubt once we get a replicable circumstance or can provide an affected disk image, we'll be able to get someone's attention.

#18 Updated by Chris Buechler over 1 year ago

  • Target version changed from 2.3.1 to 2.3.2

#19 Updated by Chris Buechler over 1 year ago

  • Target version changed from 2.3.2 to 2.4.0

#20 Updated by Kill Bill about 1 year ago

And here we go again: I'm seeing this on 3 boxes already:

Crash report details:

PHP Errors:
[02-Oct-2016 20:23:40 Europe/Prague] PHP Warning:  Invalid argument supplied for foreach() in /etc/inc/auth.inc on line 341
[02-Oct-2016 20:23:40 Europe/Prague] PHP Stack trace:
[02-Oct-2016 20:23:40 Europe/Prague] PHP   1. {main}() /etc/rc.bootup:0
[02-Oct-2016 20:23:40 Europe/Prague] PHP   2. local_sync_accounts() /etc/rc.bootup:307
[02-Oct-2016 20:23:40 Europe/Prague] PHP   3. local_user_set() /etc/inc/auth.inc:468
[02-Oct-2016 20:23:40 Europe/Prague] PHP   4. userHasPrivilege() /etc/inc/auth.inc:508
[02-Oct-2016 20:23:40 Europe/Prague] PHP   5. get_user_privileges() /etc/inc/auth.inc:357

Every time I see this, there's a hidden, totally undetected filesystem corruption.

/dev/ufsid/571632113ba7d395: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/ufsid/571632113ba7d395: clean, 5224247 free (4607 frags, 652455 blocks, 0.1% fragmentation)
Filesystems are clean, continuing...
Mounting filesystems...
Disabling APM on /dev/ad4

Sigh. Is there some usable filesystem planned for 2.4?

#21 Updated by Kill Bill about 1 year ago

In case you planned on asking whether I tried to run fsck manually, then yeah, I did on one local box - and the outcome was totally identical to that in https://redmine.pfsense.org/issues/6340#note-7 - it resulted in complete filesystem destruction. So no, I do not plan to ever test this on other boxes - they are remote ones and I have better things to do than travelling there to replace the SSDs with freshly installed ones.

#22 Updated by Derek Jackson about 1 year ago

Kill Bill,

We have been effected by this issue as we have boxes all around America running that are bricking at random.

Can you tell me if there is a fix in any of the new versions of updates out or planned soon?

#23 Updated by Kill Bill about 1 year ago

Derek, I cannot see any fix until the installer brings back the possibility to install on ZFS, and/or use a time-proven Linux filesystem (such as ext2). UFS, as far as I am concerned, is a completely broken thing I don't want to ever see anywhere in production again. I would advise to use an UPS but at least 2/3 of the affected boxes here are running on UPS already which lasts for a couple of hours. The longest blackout seen was well below one hour, so there's basically never been an unclean shutdown, ever. Yet, the filesystem is silently rotting, rotting and rotting. (Don't ever attempt to fix it with fsck, the only thing that this "utility" is capable of is a reliable filesystem destruction.)

On another note - seeing that the corruption yet again revolves around the /etc/{master.passwd,group,...} files, we basically haven't moved anywhere since Bug #4523 (IOW, blaming the /usr/sbin/pw utility was just, well... meh.)

Getting ZFS back at least should be a top priority for 2.4, IMNSHO. Afraid that's not something that's gonna happen, though. :-(((

#24 Updated by Derek Jackson about 1 year ago

Thank you for your quick reply. I'll just need to wait to see what comes of all of this.

#25 Updated by Kill Bill about 1 year ago

@Derek: If you are willing to go back to <=2.2.2 for initial full install, you could try this: https://forum.pfsense.org/index.php?topic=94656.0 - after that, it should survive upgrade to 2.3.x normally. (Of course, none of this is supported in any way.)

#26 Updated by Derek Jackson about 1 year ago

Kill Bill wrote:

@Derek: If you are willing to go back to <=2.2.2 for initial full install, you could try this: https://forum.pfsense.org/index.php?topic=94656.0 - after that, it should survive upgrade to 2.3.x normally. (Of course, none of this is supported in any way.)

I attempted the instructions you sent to me, but I keep getting this error.

kern.geom.debugflags: 0 -> 16
Deleting all gparts
Running: gpart delete -i 1 ada0
ada0s1 deleted
Running: dd if=/dev/zero of=/dev/ada0 count=3000
3000+0 records in
3000+0 records out
1536000 bytes transferred in 0.246504 secs (6231134 bytes/sec)
Cleaning up ada0
Running: dd if=/dev/zero of=/dev/ada0 count=2048
2048+0 records in
2048+0 records out
1048576 bytes transferred in 0.176085 secs (5954942 bytes/sec)
Running gpart on ada0
Running: gpart create -s mbr ada0
ada0 created
Running gpart add on ada0
Running: gpart add -b 63 -s 62533233 -t freebsd -i 1 ada0
ada0s1 added
Cleaning up ada0s1
Running: dd if=/dev/zero of=/dev/ada0s1 count=1024
1024+0 records in
1024+0 records out
524288 bytes transferred in 0.104270 secs (5028178 bytes/sec)
Stamping boot1 on ada0
Running: gpart bootcode -b /boot/boot1 ada0
bootcode written to ada0
NEWFS: /dev/ada0s1a - ZFS
Running: zpool create -m none -f tank0 ada0s1a
Running: zfs set atime=off tank0
Setting up ZFS boot loader support
Running: zpool set bootfs=tank0 tank0
Running: zpool export tank0
Running: dd if=/boot/zfsboot of=/dev/ada0s1 count=1
1+0 records in
1+0 records out
512 bytes transferred in 0.016450 secs (31125 bytes/sec)
Running: dd if=/boot/zfsboot of=/dev/ada0s1a skip=1 seek=1024
dd: /dev/ada0s1a: Operation not supported
Error 1: dd if=/boot/zfsboot of=/dev/ada0s1a skip=1 seek=1024

#27 Updated by Kill Bill about 1 year ago

Managed to completely destruct entire system by a single power cycle. Unbootable, kernel panic, endless reboot cycle. Impressive - NOT! Production ready? Absolutely NOT. Guys, is anyone actually working on ZFS? We cannot continue deploying this UFS crap on remote locations, noone is going to pay the travel expenses for these self-destructing things.

#28 Updated by Vladimir Putin about 1 year ago

So far I am happy with 2.4 running on ZFS, even it highly experimental, I use on one non so critical production firewall about month. I have had issues with UFS after unclean shutdown at least three times during last 3 years on two production firewalls.

#29 Updated by ky41083 - 11 months ago

Turn off sync mode.

On a clean system, before an issue crops up. It will let UFS do what it does, and not generate errors fsck doesn't see.

Sync mode only "works" on systems where the entire storage stack supports that part of the ATA standard through every single layer. This is why you don't see most VM's effected. Most VM's don't support the ATA stack, guest OS kernel all the way down to physical host storage. Neither do most USB bus connected SD cards. Neither do most USB flash drives.

I have one old system with a nice CF card in it, that is actually ATA compliant, plugged directly into an onboard Intel IDE controller. All forced sync mode does, is make the CF card seem horribly slow. Which in turn results in dramatically increasing the possibility of filesystem errors, because file writes take ~5x longer to complete, so ~5x more of the time failures will cause errors. Disable all mounting in sync mode on this system, and not a single filesystem error fsck doesn't correct.

Metadata on UFS is always written in sync mode, unless explicitly disabled by mounting in async mode. The file contents are either written to disk, or not, regardless of mode, so why make it less likely to happen on better hardware stacks? At the least...

At the most, weird issues with compliant storage devices / stacks, that, shocker, can't be replicated in a VM, or other hardware, that doesn't support what sync mode does, through the entire storage stack.

#30 Updated by Renato Botelho 10 months ago

  • Status changed from Confirmed to Feedback

We were misusing fsck -F option. It's supposed to be used when you plan to run background fsck after filesystems are mounted, and we don't do it.

It could lead fsck (-F) to return 0 and rc script would consider filesystems as clean, then no bg fsck was called and filesystem will operate on a bad state. I guess after multiple time happening this could end up having filesystems on a really bad shape.

I've pushed changes to stop using -F and also added more checks to start a recovery shell when we cannot mount filesystems as read-write, so firewall will not finish booting and user will be able to run fsck manually and try to recover filesystems.

#31 Updated by Graham Collinson 9 months ago

I saw an issue like this on the first power cycle of a new SG-2440.
manually running fsck didn't find an issue but every boot caused a panic and reboot.
I followed this:
http://phaq.phunsites.net/2007/07/01/ufs_dirbad-panic-with-mangled-entries-in-ufs/comment-page-1/
and found it was the temp directory that had got corrupted. Removed that directory through the file system debugger, then recreated it and everything seems fine now.
Only had the one power cycle since this incident though.

#32 Updated by Jim Pingle 9 months ago

  • Target version changed from 2.4.0 to 2.3.3

#33 Updated by Renato Botelho 9 months ago

Graham Collinson wrote:

I saw an issue like this on the first power cycle of a new SG-2440.
manually running fsck didn't find an issue but every boot caused a panic and reboot.
I followed this:
http://phaq.phunsites.net/2007/07/01/ufs_dirbad-panic-with-mangled-entries-in-ufs/comment-page-1/
and found it was the temp directory that had got corrupted. Removed that directory through the file system debugger, then recreated it and everything seems fine now.
Only had the one power cycle since this incident though.

I suppose this box was running 2.3.2, right?

#34 Updated by Renato Botelho 9 months ago

  • Target version changed from 2.3.3 to 2.4.0

I tried to break it again and couldn't after last changes. Instead of closing it without hearing from more people I'll move it back to 2.4.0 since it's not a regression and it will not hold 2.3.3

#35 Updated by Graham Collinson 9 months ago

yes it was on 2.3.2

#36 Updated by Renato Botelho 9 months ago

Graham Collinson wrote:

yes it was on 2.3.2

It was changed since 2.3.2. New code is on 2.3.3 and 2.4.0 snapshots.

#37 Updated by James Dekker 8 months ago

After power cycling an appliance running 2.4.0.b.20170401.1306 with UFS FS for over an hour, dozens of fsck's were triggered but in every case the system booted successfully.

#38 Updated by brennen smith 5 months ago

I just wanted to add the following, as this was one of the #1 reasons I was nervous about using PFSense.

Before the patch, I could (within 10 or so tries) corrupt a PFSense build by manually hitting reset in VirtualBox. Same as the issues above, but it would consistently and easily happen.

I noticed the patch was committed, so I wanted to test if it had actually fixed the issue - so I automated the following test. This took a vanilla PFSense 2.3.3 install with the basic wizard complete and two VM's behind it performing network transfers (repeated small curls). There were no packages installed, so I can't speak to the stability of something that hits the filesystem hard, such as squid.

I then ran this script on the hypervisor - essentially it would randomly sleep for 1-200 seconds and then issue a forced reset on the PFSense box (recording the output to CSV). I chose 200 seconds as it would typically complete a boot in 80-90 seconds, so there was decent probability of hitting a reset mid boot or after full boot.

#!/bin/bash

var=0

echo "Run,SleepTime,Time" 

while :
do
    let "var++" 
    vboxmanage controlvm pfsense reset
    sleepTime=$(( ( RANDOM / 200 )  + 1 ))
    echo "$var,$sleepTime,$(date +%s)" 
    sleep $sleepTime
done

I ran this for 8 hours straight - and lo and behold, it was still booting! This would be good to run for a longer period of time, but I needed to shut my laptop to bring it to work, so that wasn't an option. I attached the CSV output of the run for anyone curious.

I'm not saying the core issue is fixed (I still lament that nanobsd and the RO based filesystem is going the way of the dodo), but it's a massive improvement.

#39 Updated by Renato Botelho 2 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF