Bug #4523: master.passwd/group file corruption may occur after kernel panic or unclean shut down - pfSense - pfSense bugtracker

Actions

Copy link

Bug #4523

closed

master.passwd/group file corruption may occur after kernel panic or unclean shut down

Added by Chris Buechler about 9 years ago. Updated almost 9 years ago.

Status:

Resolved

Priority:

Very High

Assignee:

Chris Buechler

Category:

Operating System

Target version:

2.2.4

Start date:

03/14/2015

Due date:

% Done:

Estimated time:

Plus Target Version:

Release Notes:

Affected Version:

2.2.x

Affected Architecture:

Description

After a kernel panic, the passwd and/or group files may be corrupt. This seems to be a problem common to FreeBSD 10.1, and potentially fsck-related.

Files

2015-07-05_pfsense_2.2.3_corrupted_seriallog.txt (15.1 KB) 2015-07-05_pfsense_2.2.3_corrupted_seriallog.txt

Thomas X, 07/05/2015 09:10 AM

Actions

Copy link

Updated by Phillip Davis about 9 years ago

Jeremy already put a similar bug report #4519 some hours ago.

Actions

Copy link

Updated by Chris Buechler about 9 years ago

we'll keep this one, it's more specific to the root problem at hand, closed other as duplicate

Actions

Copy link

Updated by Chris Buechler about 9 years ago

Target version changed from 2.2.2 to 2.2.3

Actions

Copy link

Updated by Jim Pingle almost 9 years ago

I thought I added this here a while back but apparently not.

I have tried combinations of:

Soft updates
SU+J
Sync vs Async
Disabling atime (mostly to see if less writes helped)

Each time it only took a handful of power pulls (usually 1-3) or manually initiated panics (sysctl debug.kdb.panic=1) before /etc was corrupted.

Sometimes whole files are swapped, other times portions of them are overlapping.

Actions

Copy link

Updated by Chris Buechler almost 9 years ago

Subject changed from /etc file corruption may occur after kernel panic to /etc file corruption may occur after kernel panic or unclean shut down

this is replicable with just an unclean shut down

Actions

Copy link

Updated by Kill Bill almost 9 years ago

Reading this like this:
- https://forums.freebsd.org/threads/freebsd-on-ufs-preventing-data-loss-on-crash.30683/
- https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=183042

I am rather surprised this has not been observed earlier and wondering which was the last fsck version that was not completely braindead. Now, pulling the plug is one thing, however screwing the FS on kernel panic or unclean shutdown is really just WTF.

Actions

Copy link

Updated by Ermal Luçi almost 9 years ago

The installer and nano has been switched to SU+J same as default FreeBSD.

Actions

Copy link

Updated by Ermal Luçi almost 9 years ago

Status changed from Confirmed to Feedback

Improvements on how filesystem check/correction is being done have been merged which should help with corruption to not be as easily reproducible.

Actions

Copy link

Updated by Chris Buechler almost 9 years ago

Status changed from Feedback to Confirmed

still an issue

Actions

Copy link

#10

Updated by ky41083 - almost 9 years ago

Nano using SU+J = bad. Either go back to plain sync or just SU. All journaling does is double all meta-data writes with the goal of making fsck faster.

Doubling meta-data writes is very bad for the flash memory nano is almost always installed on. We know this. The entire goal of SU+J vs SU is essentially a faster fsck run, and is designed with spinny drives and minimizing head thrashing in mind. This should all be gladly sacrificed for extended flash storage life.

Going out on a limb, has anyone tried running fsck multiple (2+) times back to back? I've dealt with this issue on Linux running from flash, fsck had to be called at LEAST twice in a row to get a properly clean filesystem. Some parts had to be fixed to further fix additional parts, etc. Eventually my fix was to write a script that ran it 3 times in a row, haven't had an issue for years since.

Seems to be an issue on FreeBSD as well:
https://forums.freebsd.org/threads/softupdate-with-journaling-decrease-reliability.39125/

Hope this helps.

Actions

Copy link

#11

Updated by Chris Buechler almost 9 years ago

Subject changed from /etc file corruption may occur after kernel panic or unclean shut down to master.passwd/group file corruption may occur after kernel panic or unclean shut down
Priority changed from High to Very High

updated subject to narrowed down problem.

With SU, with or without J, you end up with 0 byte master.passwd, passwd, group, pwd.db, spwd.db. Or some subset of those. Without SU, you end up with master.passwd and/or group corrupted, containing parts of other files in /etc/.

It's replicable on stock FreeBSD 9.0 through 11-CURRENT by running the following:

#!/bin/sh

/usr/sbin/pw userdel -n 'admin'
/usr/sbin/pw groupadd all -g 1998
/usr/sbin/pw groupadd admins -g 1999
/usr/sbin/pw groupmod all -g 1998 -M ''
echo \$6\$O/T6GYkcgYvOBTGm\$KvOh3zhFKiA6HMEPktuImAI8/cetwEFsgj7obXdeTcQvn6mhs50HgkWt6nfnxNhTIb2w4Je6dqdKtARavxThP1 | /usr/sbin/pw usermod -q -n root -s /bin/tcsh -H 0
echo \$6\$O/T6GYkcgYvOBTGm\$KvOh3zhFKiA6HMEPktuImAI8/cetwEFsgj7obXdeTcQvn6mhs50HgkWt6nfnxNhTIb2w4Je6dqdKtARavxThP1 | /usr/sbin/pw useradd -m -k /etc/skel -o -q -u 0 -n admin -g wheel -s /bin/sh -d /root -c 'System Administrator' -H 0
/usr/sbin/pw unlock admin -q
/usr/sbin/pw groupmod all -g 1998 -M '0'
/usr/sbin/pw groupmod admins -g 1999 -M '0'

then power cycling the system. If using SU, you'll end up with 0 byte files. Without SU, you'll have corrupted files containing parts of some other file(s) in /etc.

Still investigating, we'll be reporting specifics upstream soon.

Actions

Copy link

#12

Updated by Kill Bill almost 9 years ago

Chris Buechler wrote:

If using SU, you'll end up with 0 byte files. Without SU, you'll have corrupted files containing parts of some other file(s) in /etc.

Is this before or after running fsck? IOW, is UFS just unusable, or is fsck being full retard?

Meanwhile - is this ZFS howto still valid for 2.2.x? Is this howto still valid for 2.2.x? https://forum.pfsense.org/index.php?topic=71953.0

Since, frankly I've had enough of this. I haven't created a user/group on any of the boxes that get randomly screwed for ages. WTH these files are being constantly damaged?

Actions

Copy link

#13

Updated by Chris Buechler almost 9 years ago

That's after fsck (including after multiple runs). They aren't "constantly damaged", only after unclean shut downs, and only a minority of the time at that. The above shell script replicates 100% reliably on stock FreeBSD, but in real world usage it's less likely you'll hit it nearly that reliably. It seems to affect slower drives more often than faster ones. Maybe 1 in a handful of times pulling the plug up to one in 2-3 dozen times.

no idea if those ZFS instructions still work. They might. We haven't done anything to intentionally disable that, but we don't test it either.

Actions

Copy link

#14

Updated by Kill Bill almost 9 years ago

Chris Buechler wrote:

That's after fsck (including after multiple runs).

Well what I meant is actually whether it's fsck screwing those files or whether they were truncated/mangled already before running fsck. (IOW, power off the system and stick the drive into some other box and mount it, instead of trying to boot from it and letting fsck do its sloppy job.)

Actions

Copy link

#15

Updated by Jim Thompson almost 9 years ago

It's not fsck.

it's likely a bug in SU (with or without journaling.)

the fix (for now) is to mount / "sync" on all pfSense installs (nano or full).

WTH these files are being constantly damaged?

because of the internals of what the "pw" command does, crossed by some bug in UFS that has yet to be found.

Actions

Copy link

#16

Updated by Phillip Davis almost 9 years ago

"sync" seems like "a good thing" on root file system "/" for pfSense use cases anyway. pfSense uses would not modify stuff in "/" very often at run time, and thus having that root file-system activity be synchronous would have almost imperceptible performance impact. Might as well use "sync" whether the underlying bug here is fixed or not.
Real-time file system activity on pfSense is mostly to /var and /tmp for updating DHCP lease files, writing log entries and such like, plus packages that cache stuff (Squid...).

Actions

Copy link

#17

Updated by Kill Bill almost 9 years ago

Updated ZFS howto for people who are on full install and are simply tired of this... https://forum.pfsense.org/index.php?topic=94656

(One would assume UFS to be somewhat mature after all those years... ugh.)

Actions

Copy link

#18

Updated by Denny Page almost 9 years ago

Does sync actually avoid the issue? Update 4 suggested that this was not the case...

Sync for root fs generally seems like a good idea, but only if is not updated infrequently. Given that the default install has var in the root fs, this would not be a good choice if there are packages that update /var frequently (ntopng, squid, etc.).

Actions

Copy link

#19

Updated by Jim Pingle almost 9 years ago

It was apparently an error in my notes... I looked back at a forum post I made when I first tested that mid-April and and I had noted that although fsck still ran and found issues with sync, the files remained intact: https://forum.pfsense.org/index.php?topic=88439.msg511477#msg511477

Actions

Copy link

#20

Updated by Chris Buechler almost 9 years ago

sync definitely avoids the root issue. I have a system that's now upwards of 1000 power cycles with 0 issues with sync.

The root problem seems to be within pw rather than anything to do with UFS. We'll pursue a proper fix there. In the mean time, setting sync does fix the problem and shouldn't have a negative impact for our use cases.

I updated the installer to set sync. We'll need to add code to add that to fstab on upgraded systems.

Actions

Copy link

#21

Updated by Denny Page almost 9 years ago

Wow, there's a name I haven't heard in 20+ years.

Actions

Copy link

#22

Updated by Jim Thompson almost 9 years ago

Kill Bill wrote:

Updated ZFS howto for people who are on full install and are simply tired of this... https://forum.pfsense.org/index.php?topic=94656

(One would assume UFS to be somewhat mature after all those years... ugh.)

Do let me know when you have sufficient experience with filesystems to decide if something is "mature" or not.

Actions

Copy link

#23

Updated by Jim Thompson almost 9 years ago

Denny Page wrote:

Wow, there's a name I haven't heard in 20+ years.

Yes, and cmb shouldn't have quoted a private communication without permission. I've edited his post.

Actions

Copy link

#24

Updated by Ermal Luçi almost 9 years ago

Status changed from Confirmed to Feedback

Actions

Copy link

#25

Updated by Kill Bill almost 9 years ago

There's something badly broken on nanobsd with this...

https://forum.pfsense.org/index.php?topic=94900.0

Actions

Copy link

#26

Updated by Jim Pingle almost 9 years ago

Status changed from Feedback to Confirmed

Moving this back to Confirmed since the upgrade code is still missing for existing installations, and it appears as though on the 2.2.3 snapshots the sync flag is not being added to the root slice during install. I see the code in the bsdinstaller repo but it's not in the snapshots.

Actions

Copy link

#27

Updated by Ermal Luçi almost 9 years ago

Status changed from Confirmed to Feedback

Installer has been updated for new snaps and upgrade code been put in place.

Actions

Copy link

#28

Updated by Chris Buechler almost 9 years ago

Status changed from Feedback to Resolved

fixed. We'll again verify as part of the release test matrix on each install type.

Actions

Copy link

#29

Updated by Chris Buechler almost 9 years ago

Status changed from Resolved to Feedback
Target version changed from 2.2.3 to 2.2.4
Affected Version changed from 2.2 to 2.2.x

this is adequately worked around in 2.2.3 with the usage of sync. Now that we have a proper fix for pw in 2.2.4, and sync has been removed from the installer, and upgrade code changed to remove sync where it's enabled, moving this back to feedback to confirm those sync changes.

Actions

Copy link

#30

Updated by Thomas X almost 9 years ago

File 2015-07-05_pfsense_2.2.3_corrupted_seriallog.txt 2015-07-05_pfsense_2.2.3_corrupted_seriallog.txt added

Today I had a power loss with pfSense 2.2.3 AMD64 NanoBSD, which seems to have corrupted the installation. The system was upgraded from pfSense 2.2.1 AMD64 NanoBSD 7 days ago.

Afterwards, when power was available again, the system didn't came up correctly, the web frontend showed an internal server error, login was not possible even with serial console.

See the attached log which was recorded when doing another hard reset. Switching the bootup slice made my day, now running 2.2.1 just fine.

I'm not sure if this corruption is releated to this issue, please ignore if it's not. I was just wondering why this could happen although sync was added in 2.2.3.

Best regards
Thomas

Actions

Copy link

#31

Updated by Thomas X almost 9 years ago

One addition: Filesystem has been in standard NanoBSD mode (ReadOnly) when the loss of power appeared.

Actions

Copy link

#32

Updated by Kill Bill almost 9 years ago

Thomas X wrote:

I was just wondering why this could happen although sync was added in 2.2.3.

Probably because the sync mount option was never a proper fix in the first place; plus performs absolutely horribly even on full installs with fast SATA HDDs. Try with latest 2.2.4 snapshots.

Actions

Copy link

#33

Updated by Jim Thompson almost 9 years ago

The sync option was not an optimal fix, but it was a proper fix, as it does fix the corruption issue, and was what we could get done (with testing) prior to the correct fix (which is in 2.2.4 and in FreeBSD.)

Actions

Copy link

#34

Updated by Jim Thompson almost 9 years ago

Assignee set to Chris Buechler

Actions

Copy link

#35

Updated by Chris Buechler almost 9 years ago

Status changed from Feedback to Resolved

sync no longer added to new installs, and confirmed the upgrade code removes it where it's set and doesn't change anything where it isn't.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

pfSense

Custom queries

Bug #4523

master.passwd/group file corruption may occur after kernel panic or unclean shut down

Updated by Phillip Davis about 9 years ago

Updated by Chris Buechler about 9 years ago

Updated by Chris Buechler about 9 years ago

Updated by Jim Pingle almost 9 years ago

Updated by Chris Buechler almost 9 years ago

Updated by Kill Bill almost 9 years ago

Updated by Ermal Luçi almost 9 years ago

Updated by Ermal Luçi almost 9 years ago

Updated by Chris Buechler almost 9 years ago

Updated by ky41083 - almost 9 years ago

Updated by Chris Buechler almost 9 years ago

Updated by Kill Bill almost 9 years ago

Updated by Chris Buechler almost 9 years ago

Updated by Kill Bill almost 9 years ago

Updated by Jim Thompson almost 9 years ago

Updated by Phillip Davis almost 9 years ago

Updated by Kill Bill almost 9 years ago

Updated by Denny Page almost 9 years ago

Updated by Jim Pingle almost 9 years ago

Updated by Chris Buechler almost 9 years ago

Updated by Denny Page almost 9 years ago

Updated by Jim Thompson almost 9 years ago

Updated by Jim Thompson almost 9 years ago

Updated by Ermal Luçi almost 9 years ago

Updated by Kill Bill almost 9 years ago

Updated by Jim Pingle almost 9 years ago

Updated by Ermal Luçi almost 9 years ago

Updated by Chris Buechler almost 9 years ago

Updated by Chris Buechler almost 9 years ago

Updated by Thomas X almost 9 years ago

Updated by Thomas X almost 9 years ago

Updated by Kill Bill almost 9 years ago

Updated by Jim Thompson almost 9 years ago

Updated by Jim Thompson almost 9 years ago

Updated by Chris Buechler almost 9 years ago