Project

General

Profile

Actions

Bug #6311

closed

pfSense 2.3 locking up

Added by Markus Strangl over 8 years ago. Updated over 8 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
Unknown
Target version:
-
Start date:
05/04/2016
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
Affected Architecture:

Description

Hi pfSense team,
since upgrading from 2.2.6 to 2.3 we have had a series of weird lock-ups on our pfSense clusters.
For no apparent reason, all services suddenly seem to die, i.e. no traffic is routed anymore, the WebGUI isn't reachable, and in all but one case even SSH didn't reply anymore.
The machine doesn't panic or reboot, though.. so we had to pull the plug to reset. As a consequence, there are no crash reports available.
The system seems to be "just alive enough" that no watchdog timer is triggering, and CARP does not initiate a failover.
The only thing shown in the system logs at the time of the lockup is this:
system.log:
Apr 28 22:37:56 kws-fw01 check_reload_status: updating dyndns GW_WAN
Apr 28 22:37:56 kws-fw01 check_reload_status: Restarting ipsec tunnels
Apr 28 22:37:56 kws-fw01 check_reload_status: Restarting OpenVPN tunnels/interfaces
Apr 28 22:37:56 kws-fw01 check_reload_status: Reloading filter
Apr 28 22:37:58 kws-fw01 php-fpm: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use GW_WAN.
Apr 28 22:37:58 kws-fw01 xinetd26075: Starting reconfiguration
Apr 28 22:37:58 kws-fw01 xinetd26075: Swapping defaults
Apr 28 22:37:58 kws-fw01 xinetd26075: readjusting service 6969-udp
Apr 28 22:37:58 kws-fw01 xinetd26075: Reconfigured: new=0 old=1 dropped=0 (services)
- this is the time when our monitoring systems start reporting the infrastructure behind the pfSense going dead -
Apr 28 22:48:13 kws-fw01 sshd7986: Accepted keyboard-interactive/pam for admin from 80.120.61.1 port 29853 ssh2
Apr 28 22:49:21 kws-fw01 reboot: rebooted by admin

There are no further entries in any of the log files between the dropout time and the reboot, so I'm not sure whether this due to no traffic being handled anymore, or the syslog daemon locking up as well.
The machines worked fine with 2.2.6 and have been thoroughly stress tested, so I'm pretty sure no hardware issue is involved.

System info:
SuperMicro Intel Westmere rack boxes, 2 each in HA Cluster with CARP and pfSync, Intel x540 10G network cards (ix driver)

Actions #1

Updated by Jim Pingle over 8 years ago

  • Status changed from New to Feedback

Are you using IPsec? Look at #6296 and see if that might be the same condition you are hitting.

Actions #2

Updated by Markus Strangl over 8 years ago

Jim Pingle wrote:

Are you using IPsec? Look at #6296 and see if that might be the same condition you are hitting.

We're using IPsec tunnels to 5 other locations, but I don't see anything unusual in the CPU usage. 'top' stays at less than 5%, far from any fully loaded core.

I have now reverted the affected remote clusters to 2.2.6 to ensure availability of our production locations, but one of our local identical clusters is starting to exhibit similar misbehavior. I've sniffed around a bit, and it seems to start with the IPsec tunnels no longer passing traffic through. Next the WebGUI dies, with the syslog stating that nginx was no longer receiving data from the backend, which seems to be php-fpm. Restarting php-fpm via the SSH menu brings the WebGUI back for about 2-3 minutes after which php-fpm seems to die again. Ultimately, only a reboot brings the machine back to a workable state. Trying to CARP-failover to the secondary cluster node leads to php-fpm throwing a crash dump and CARP subsequently showing up as inactive (instead of the backup state it's really in). (The fpm crash dump is in crash submission mailbox, if you need that..)

Actions #3

Updated by Marco Manenti over 8 years ago

Markus Strangl wrote:

System info:
SuperMicro Intel Westmere rack boxes, 2 each in HA Cluster with CARP and pfSync, Intel x540 10G network cards (ix driver)

Mee to!

NIC hangs, system stable.

dmesg, do you have errors (intel drivers)?

if yes, try to add in /boot/loader.conf legal.intel_iwi.license_ack="1"

Actions #4

Updated by Jan Jurkus over 8 years ago

Marco Manenti wrote:

if yes, try to add in /boot/loader.conf legal.intel_iwi.license_ack="1"

Uhm, I think that's only suitable for the iwi driver, which only supports wireless cards: https://www.freebsd.org/cgi/man.cgi?iwi

Actions #5

Updated by Chris Buechler over 8 years ago

  • Status changed from Feedback to Duplicate
  • Affected Version deleted (2.3)
  • Affected Architecture added
  • Affected Architecture deleted (amd64)

this is likely a duplicate of #6296. Some other things noted, like the GUI dying, are probably duplicates of other different things fixed in 2.3.1 or 2.3.1_1. it's certainly not an actionable bug report, and needs to start over on 2.3.1_1.

Markus: glad to look at one of your systems if you're still seeing issues with 2.3.1_1, let us know.

Actions

Also available in: Atom PDF