Project

General

Profile

Actions

Bug #6494

closed

Hang during bootup on lock('filter.lock')

Added by Harry Coin about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Unknown
Target version:
Start date:
06/16/2016
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
All
Affected Architecture:

Description

Hang during bootup 3 of 5 tries on lock('filter.lock')

Why does filterlog hold /tmp/filter.lock open for writing?

To see this, I added these debug writes to /etc/inc/filter.inc:

...
echo "1\n";
/* Use filter lock to not allow concurrent filter reloads during this run. */
$filterlck = lock('filter', LOCK_EX);
echo "2\n";

filter_pflog_start();
echo "3\n";
update_filter_reload_status(gettext("Initializing"));
echo "4\n";
/* invalidate interface cache */
get_interface_arr(true);
echo "5\n";
if (isset($config['system']['developerspew'])) {
$mt = microtime();
echo "filter_configure_sync() being called $mt\n";
}
/* Get interface list to work with. */
filter_generate_optcfg_array();
echo "6\n";
if (platform_booting() true) {
echo gettext("Configuring firewall");
}
/* generate aliases */
if (platform_booting() true) {
echo ".";
}
echo "7\n";
update_filter_reload_status(gettext("Creating aliases"));
...

and /etc/inc/util.inc:
...
function lock($lock, $op = LOCK_SH) {
global $g, $cfglckkeyconsumers;
if (!$lock) {
die(gettext("WARNING: A name must be given as parameter to lock() function."));
}
if (!file_exists("{$g['tmp_path']}/{$lock}.lock")) {
echo "making {$g['tmp_path']}/{$lock}.lock\n";
@touch("{$g['tmp_path']}/{$lock}.lock");
@chmod("{$g['tmp_path']}/{$lock}.lock", 0666);
}
$cfglckkeyconsumers++;
echo "locking {$g['tmp_path']}/{$lock}.lock, count {$cfglckkeyconsumers}\n";
...

Heres's the boot trace on the serial console:

Welcome to pfSense 2.3.1-RELEASE (Patch 1) on the 'pfSense' platform...

No core dumps found.
Creating symlinks......ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/local/lib /usr/local/lib/ipsec
32-bit compatibility ldconfig path: /usr/lib32
done.
making /tmp/config.lock
locking /tmp/config.lock, count 1
External config loader 1.0 is now starting... vtbd0s1w starting... vtbd0s1a
Launching the init system.....locking /tmp/config.lock, count 1
.. done.
Initializing.................. done.
Starting device manager (devd)...done.
Loading configuration....locking /tmp/config.lock, count 2
locking /tmp/config.lock, count 3
..done.
Updating configuration...done.
Cleaning backup cache.................................done.
Setting up extended sysctls...done.
Setting timezone...done.
Configuring loopback interface...done.
Starting syslog...done.
Starting Secure Shell Services...done.
Setting up polling defaults...done.
Setting up interfaces microcode...done.
Configuring loopback interface...done.
Creating wireless clone interfaces...done.
Configuring LAGG interfaces...done.
Configuring VLAN interfaces...done.
Configuring QinQ interfaces...done.
Configuring ISP1 interface...done.
Configuring LAN interface...done.
Configuring ISP2 interface...done.
Configuring CARP settings...done.
Configuring PFSYNC interface...done.
Configuring CARP settings...done.
Syncing OpenVPN settings...tap1: Ethernet address: xxxxx
done. << NOTE: Maybe a 4 sec delay here >>
locking /tmp/resolvconf.lock, count 4
resolv.conf generated.
1
locking /tmp/filter.lock, count 5

... N.B. Magically, the file /tmp/filter.lock must already exist as the 'making' debug write never runs.
... ssh works at this point, bootup hangs here forever. Logging in by ssh gets:

fuser /tmp/filter.lock
/tmp/filter.lock: 2792w 90560w 36404w 299w

Notice: there is no report of any lock being held.
[2.3.1-RELEASE][]/root: ps axu | grep 2792
root 2792 0.0 1.6 268248 33408 - I 11:39AM 0:00.11 php-fpm: pool nginx (php-fpm)
[2.3.1-RELEASE][]/root: ps axu | grep 90560
root 90560 0.0 1.6 268248 33760 - I 11:39AM 0:02.26 php-fpm: pool nginx (php-fpm)
[2.3.1-RELEASE][]/root: ps axu | grep 36404
root 36404 0.0 0.1 16676 2088 - Ss 11:39AM 0:00.01 /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
[2.3.1-RELEASE][]/root: ps axu | grep 299
root 299 0.0 1.9 272244 39260 u0 I+ 11:38AM 0:41.37 /usr/local/bin/php-cgi -f /etc/rc.bootup

Killing off filterlog by
$kill 36404
allows boot to continue, whereafter we get normal operations (including a restart of filterlog).
.. bootup log continues:

2
3
4
5
6
Configuring firewall.7
.....done.
Starting PFLOG...done.
Setting up gateway monitors...done.
Synchronizing user settings...done.
Starting webConfigurator...done.
Configuring CRON...done.
Starting DNS Resolver...done.
Starting NTP time client...done.
Starting DHCP service...done.
Starting DHCPv6 service...done.
1
locking /tmp/filter.lock, count 6
2
3
4
5
6
Configuring firewall.7
.....done.
making /tmp/ipsec.lock
locking /tmp/ipsec.lock, count 7
Generating RRD graphs...done.
Starting syslog...done.
Starting CRON... done.
locking /tmp/config.lock, count 1
Starting package arping...done.
Starting package bind...locking /tmp/config.lock, count 2
done.
Starting package iperf...done.
Starting package Notes...done.
Starting package nut...done.
Starting package OpenVPN Client Export Utility...done.
Starting package squid3...done.
Starting /usr/local/etc/rc.d/named.sh...done.
Starting /usr/local/etc/rc.d/sqp_monitor.sh...done.
locking /tmp/config.lock, count 1
locking /tmp/config.lock, count 1
pfSense (pfSense) 2.3.1-RELEASE (Patch 1) amd64 Wed May 25 14:53:06 CDT 2016
Bootup complete

FreeBSD/amd64 (gate1.quietfountain.com) (ttyu0)

locking /tmp/config.lock, count 1
  • Welcome to pfSense 2.3.1-RELEASE-p1 (amd64 full-install) on ....
Actions #1

Updated by Chris Buechler about 5 years ago

  • Status changed from New to Feedback

there's no magic about whether the lock file exists, its presence isn't a lock on its own. It's never deleted, outside of /tmp being wiped early during boot.

filterlog doesn't hold the lock file either, that's probably some passthrough from the fact it was launched by PHP, which has the lock. Seems like it ends up stuck in filter_pflog_start given where the boot hangs. Add some more debug logging there in filter_configure_sync and filter_pflog_start to see.

given this happens to no one else, I suspect you have some kind of modification on the system?

Actions #2

Updated by Harry Coin about 5 years ago

Just happened again on a clean install of 2.3.1-p5

It deadlocks HARD on lock('filter.lock'). 100% of the time, to get out of it, all that's necessary is to ssh in, fuser /tmp/filter.lock, find out which process is filterlog, "kill <filterlog pid>" and it resumes perfectly.

Actions #3

Updated by Harry Coin about 5 years ago

I haven't figured out yet why two php-fpm processes hold filter.lock open during the bootup. I wonder if it has something to do with a GUI window(s) being left open in the browser, and that activity happening at a time before the bootup process is ready for it. Seems very odd two php-fpm processes and filterlog hold /tmp/filter.lock open.

Actions #4

Updated by Harry Coin about 5 years ago

Notice too that filterlog re-spawns on its own after having been killed.

Actions #5

Updated by Harry Coin about 5 years ago

The only thing I can think of that's unusual about this install is that it starts a site-site vpn client, which takes a bit of time just before the hard lock fails. I bet it's just that special correct amount of time some other process can't continue without also having the filter lock. It's a deadlock or as the brits call it 'deadly embrace'.

Actions #6

Updated by Chris Buechler about 5 years ago

you're doing something far more atypical than having a site to site VPN. GUI windows in the browser being left open can't do anything during bootup unless you logged back in already, their sessions are invalidated by the time it's booting.

you're going to have to add some more debug logging and track it down further given this certainly isn't a replicable problem and not something anyone else has seen.

Actions #7

Updated by Harry Coin about 5 years ago

I've got a case that hangs every type as noted above. And I have a patch.

/cf/hc_filterlog:

#!/bin/tcsh
3>&-
4>&-
5>&-
6>&-
7>&-
8>&-
/cf/filterlog $*

Then

cp /usr/local/sbin/filterlog /cf/
ln -sf /usr/local/sbin/filterlog /cf/hc_filterlog

And it works every time. What's happening is the exec of filterlog gets a copy of all the file descriptors open at the time. That includes a locked write to /tmp/filter.lock. In my case the boot timing is just right so a second call to lock the filter.lock happens while the prior instance of filterlog still exists.

Notice in the php for the code that launches filterlog there's an unfinished thought to kill the previous instance. I think this has happened before.

So, I think a real fix is to add

closeall(3);

near the top of filterlog.c

Actions #8

Updated by Chris Buechler about 5 years ago

  • Status changed from Feedback to Confirmed
  • Assignee set to Luiz Souza
  • Target version set to 2.3.1-p6
  • Affected Version changed from 2.3.1 to All

Nice catch, Harry.

Looks like this should be easy to fix, Luiz please review.

Actions #9

Updated by Chris Buechler about 5 years ago

  • Target version changed from 2.3.1-p6 to 2.3.2
Actions #10

Updated by Harry Coin about 5 years ago

Corrected patch (ln -s terms swapped)

cat > /cf/hc_filterlog
#!/bin/tcsh
3>&-
4>&-
5>&-
6>&-
7>&-
8>&-
/cf/filterlog $*

cp /usr/local/sbin/filterlog /cf/
ln -sf /cf/hc_filterlog /usr/local/sbin/filterlog
Actions #11

Updated by Luiz Souza about 5 years ago

  • Status changed from Confirmed to Feedback

Yes, this looks correct. In addition to this fix I would like to open the files with FD_CLOEXEC set, but I could not find a way to do that on PHP.

The closefrom(3) call was committed to filterlog and should be available on next snapshots.

Please let us know if that works for you.

Actions #12

Updated by Chris Buechler about 5 years ago

That seems to work.

Harry: could you please try the latest 2.3.2 snapshot ASAP (we're looking to build release on Monday) and let us know? You're the only one I'm aware of that can hit the issue in a real-world scenario.

Actions #13

Updated by Harry Coin about 5 years ago

I've never tried a 'snapshot' before this.

So... here goes.... While I'm waiting for 2.3.2.a.20160714.0044:

The bug previously appears to happen in the moments after 'Syncing OpenVPN Settings' appears in the boot sequence, even though in this case that checkbox in the HA list is not checked. The note 'configuring firewall' which appears just after does not appear until either my fix was applied or an ssh client killed off the filterlog process.

Some thoughts while still waiting...

1) Please retain the ability to plug private code files in /usr/local/pkg items to customize the filter rules. If it wasn't there I'd have to create it.

2) Only if this release doesn't already do it: It would be nice for the ipv6 DHCP package to sync the static ip list to the failover system config.xml like the v4 version does. Hand hacking the config.xml for the failover gate is where we are now. I don't care about the leases, splitting the range is fine for that.

3) Latest version: Why does every debug message in resolver.log appear twice for Bind?
Jul 14 11:16:59 gate1 named42081: found 2 CPUs, using 2 worker threads
Jul 14 11:16:59 gate1 named42201: found 2 CPUs, using 2 worker threads
Jul 14 11:17:00 gate1 named42201: using 1 UDP listener per interface
Jul 14 11:17:00 gate1 named42081: using 1 UDP listener per interface
Jul 14 11:17:00 gate1 named42201: using up to 4096 sockets
Jul 14 11:17:00 gate1 named42081: using up to 4096 sockets
Jul 14 11:17:01 gate1 named42201: loading configuration from '/etc/namedb/named.conf'
Jul 14 11:17:01 gate1 named42081: loading configuration from '/etc/namedb/named.conf'
..
Jul 14 11:17:01 gate1 named42201: listening on IPv4 interface bridge0, 198<yadda>#53
Jul 14 11:17:01 gate1 named42201: listening on IPv6 interface bridge0, 2605:<yadda>:2#53
Jul 14 11:17:01 gate1 named42081: could not listen on UDP socket: address in use
Jul 14 11:17:01 gate1 named42081: creating IPv4 interface bridge0 failed; interface ignored
Jul 14 11:17:01 gate1 named42081: listening on IPv6 interface bridge0, 2605:<yadda>:2#53
Jul 14 11:17:01 gate1 named42081: could not listen on UDP socket: address in use
Jul 14 11:17:01 gate1 named42081: creating IPv6 interface bridge0 failed; interface ignored
..

The new version booted far enough to give me ssh access (it did this before the changes as well). And:
The update process did not replace the link to my hack/patch at /usr/sbin/filterlog. Rolling back...
Rebooting old version with patch...
deleting my link to hack, replacing with old filterlog ( I copied it back from /cf where my script eventually loaded it):

-r-xr-xr-x 1 root wheel 24512 Jul 14 10:15 /usr/sbin/filterlog

Re-update, I noticed the filterlog package was among those in the update list.
re-boot:

Well, I'm not sure what I have running. On the one hand I see:

pfSense (pfSense) 2.3.1-RELEASE (Patch 5) amd64 Thu Jul 14 00:44:02 CDT 2016

and then I see:

  • Welcome to pfSense 2.3.2-DEVELOPMENT (amd64 full-install) on

4) Would it foil some key long term goal to NOT change existing ssh host keys on update? I'm so tired of

ssh-keygen -f "/root/.ssh/known_hosts" -R <dns name>
ssh-keygen -f "/root/.ssh/known_hosts" -R <ip address>

on all the necessary systems. Maybe a new package called 'Stuff you should do every year, or every month if you know you're a high risk target' (or a shorter name, like 'Odo'. It checksum checks all the system files, lists all the files missing/added to the std distrib, changes all the host keys, regenerates all the certs that have private keys stored..., nags the admin re passwords that haven't changed.

Ok, it booted far enough to give me a shell access.

new filterlog:
.... unchanged in date, and size. Are you sure you've updated the filterlog package??

Wait a sec... I think the "in the movies" phrase is 'my spidey sense is tingling....'

..root: find / -name filterlog -print
/usr/local/sbin/filterlog
/usr/sbin/filterlog

Oh dear. The new one leaves the old intact. I wonder how many other old binary executables are laying about? Two instances of bind maybe?

I deleted the old binary, removed my hack, deleted the filterlog that was left by the previous install in /usr/sbin, rebooted with only the new filterlog in place and...

... it works.

Actions #14

Updated by Chris Buechler about 5 years ago

  • Status changed from Feedback to Resolved

thanks for the feedback

Actions

Also available in: Atom PDF