Bug #8335: System hang with LACP downlink to UniFi switch - pfSense - pfSense bugtracker

Actions

Copy link

Bug #8335

open

System hang with LACP downlink to UniFi switch

Added by Mike Pastore over 7 years ago. Updated almost 6 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

LAGG Interfaces

Target version:

Start date:

02/16/2018

Due date:

% Done:

Estimated time:

Plus Target Version:

Release Notes:

Affected Version:

2.4.2_1

Affected Architecture:

amd64

Description

I have an RCC-VE 2440 (2015) with igb1 and igb2 aggregated into lagg0 and connected to a UniFi switch. UniFi supports aggregated links but only via LACP, and it is not configurable. Per Ubiquiti support, it uses L2 data only in the hashing computation, strict mode should be enabled, and fast timeout should be disabled. I am using this aggregated link as a trunk, with a number of VLANs defined on lagg0 and assigned to different interfaces. lagg0 itself is assigned to LAN to catch untagged traffic.

In this configuration, pfSense hangs at least once every 48 hours. There doesn't seem to be a pattern to the hangs. I see nothing in the system logs when it happens. I'll typically have a `screen` attached to the serial console from another system and I see nothing there, either. A hang is defined as no network traffic passing (and the unit becoming unpingable from the LAN) and the serial console becoming unresponsive. Cycling the power brings it back up.

The only "solution" that I've found so far is to disaggregate the links and go back to a single downlink. I'm running pfSense CE 2.4.2_1 with CoreBoot flashed to V17. Here's a (hopefully comprehensive) list of everything else I tried to solve the problem (one at a time with reboots to apply as necessary):

Reinstall pfSense
- UFS on eMMC
- ZFS on eMMC
- ZFS on mSATA
Force `ifconfig lagg0 lagghash l2` per Ubiquiti support
Disable hardware checksum offloading
- In GUI only
- In GUI and `ifconfig <device> -vlanhwcsum -txcsum6`
Disable TCP segmentation offloading
- In GUI and set `net.inet.tcp.tso=0` tunable
- In GUI, set tunable, and `ifconfig <device> -vlanhwtso`
Set `kern.ipc.nmbclusters=1000000` tunable (this is set across all attempts)
Add `hw.igb.num_queues=1` to loader.conf.local
Add `hw.pci.enable_msix=0` to loader.conf.local
Disable crypto (set "Cryptographic Hardware" to "none")
Use RAM disk for /var and /tmp
Put the router and the switch on a UPS

The following packages are installed:

Avahi
Netgate_Coreboot_Upgrade
Notes
nut
pfBlockerNG
Service_Watchdog
sudo
System_Patches

The following services are running:

avahi
dhcpd
dnsbl
dpinger
igmpproxy
miniupnpd
ntpd
radvd
sshd
syslogd
unbound

Actions

Copy link

Updated by Jeff Wischkaemper about 7 years ago

I'm experiencing similar symptoms (pfSense hanging frequently), though with different hardware. My configuration hangs less frequently than yours (generally every 3-4 days, though it happened twice yesterday), but with identical symptoms: no WAN or LAN traffic, and no response to keyboard on the console. The entire system is unresponsive.

I'm running a Supermicro E300 setup with a Xeon D1518, but using only a single WAN/LAN setup. I have an HP unmanaged switch on the LAN side of the network, and have multiple (9) public IP aliases assigned to the WAN interface. Thinking the Supermicro might be the problem, I swapped the config to a RCC-VE 4860, which proceeded to also hang every couple of days. I've subsequently swapped back to the E300, which at least I can power cycle remotely.

Unfortunately, I don't see a lot of overlap with our services or packages - I don't have any of the packages you're using installed, and the only commonalities we have on services are dhcpd, dpinger, ntp, sshd, syslogd, and unbound (which I assume are fairly stable). I have both IPSec and OpenVPN running (about 100 clients connected in, full time), as well as FreeRADIUS 3. No other packages are installed.

I am running on a ZFS partition, 2.4.2_1. I also have igb interfaces. My configuration file was ported over from an APU2 originally installed as a 2.3.X NanoBSD install, if that matters. Originally I thought it might be something about going from a Nano->Full install, or possibly going from AMD->Intel hardware / NICs. The APU setup was reliable to a fault - the only time we ever had downtime was when installing an update. The new firewall was simply a full Backup / Restore from the old 2.3 setup. It has been unreliable from the moment we put it in (2 months now).

One other thing I have noticed though haven't been able to definitively confirm - I feel like the system may happen more frequently (more quickly) if I leave a connection to the pfSense WebGUI open.

All of that to say - I'm seeing something very similar in terms of behavior, but I have a very different hardware / pfSense configuration.

Are there any log files that persist after boot that we could look in to try to debug this problem?

Actions

Copy link

Updated by Mike Pastore about 7 years ago

Jeff Wischkaemper wrote:

I have an HP unmanaged switch on the LAN side of the network

Can you try a different switch?

Actions

Copy link

Updated by Jeff Wischkaemper about 7 years ago

Mike Pastore wrote:

Jeff Wischkaemper wrote:

I have an HP unmanaged switch on the LAN side of the network

Can you try a different switch?

Unfortunately not. 1) I don't have one, 2) it's a production server and 3) the installation is actually in a different state from where I am, making any swap like that quite challenging.

Actions

Copy link