System hang with LACP downlink to UniFi switch
I have an RCC-VE 2440 (2015) with igb1 and igb2 aggregated into lagg0 and connected to a UniFi switch. UniFi supports aggregated links but only via LACP, and it is not configurable. Per Ubiquiti support, it uses L2 data only in the hashing computation, strict mode should be enabled, and fast timeout should be disabled. I am using this aggregated link as a trunk, with a number of VLANs defined on lagg0 and assigned to different interfaces. lagg0 itself is assigned to LAN to catch untagged traffic.
In this configuration, pfSense hangs at least once every 48 hours. There doesn't seem to be a pattern to the hangs. I see nothing in the system logs when it happens. I'll typically have a `screen` attached to the serial console from another system and I see nothing there, either. A hang is defined as no network traffic passing (and the unit becoming unpingable from the LAN) and the serial console becoming unresponsive. Cycling the power brings it back up.
The only "solution" that I've found so far is to disaggregate the links and go back to a single downlink. I'm running pfSense CE 2.4.2_1 with CoreBoot flashed to V17. Here's a (hopefully comprehensive) list of everything else I tried to solve the problem (one at a time with reboots to apply as necessary):
- Reinstall pfSense
- UFS on eMMC
- ZFS on eMMC
- ZFS on mSATA
- Force `ifconfig lagg0 lagghash l2` per Ubiquiti support
- Disable hardware checksum offloading
- In GUI only
- In GUI and `ifconfig <device> -vlanhwcsum -txcsum6`
- Disable TCP segmentation offloading
- In GUI and set `net.inet.tcp.tso=0` tunable
- In GUI, set tunable, and `ifconfig <device> -vlanhwtso`
- Set `kern.ipc.nmbclusters=1000000` tunable (this is set across all attempts)
- Add `hw.igb.num_queues=1` to loader.conf.local
- Add `hw.pci.enable_msix=0` to loader.conf.local
- Disable crypto (set "Cryptographic Hardware" to "none")
- Use RAM disk for /var and /tmp
- Put the router and the switch on a UPS
The following packages are installed:
The following services are running:
#1 Updated by Jeff Wischkaemper about 2 years ago
I'm experiencing similar symptoms (pfSense hanging frequently), though with different hardware. My configuration hangs less frequently than yours (generally every 3-4 days, though it happened twice yesterday), but with identical symptoms: no WAN or LAN traffic, and no response to keyboard on the console. The entire system is unresponsive.
I'm running a Supermicro E300 setup with a Xeon D1518, but using only a single WAN/LAN setup. I have an HP unmanaged switch on the LAN side of the network, and have multiple (9) public IP aliases assigned to the WAN interface. Thinking the Supermicro might be the problem, I swapped the config to a RCC-VE 4860, which proceeded to also hang every couple of days. I've subsequently swapped back to the E300, which at least I can power cycle remotely.
Unfortunately, I don't see a lot of overlap with our services or packages - I don't have any of the packages you're using installed, and the only commonalities we have on services are dhcpd, dpinger, ntp, sshd, syslogd, and unbound (which I assume are fairly stable). I have both IPSec and OpenVPN running (about 100 clients connected in, full time), as well as FreeRADIUS 3. No other packages are installed.
I am running on a ZFS partition, 2.4.2_1. I also have igb interfaces. My configuration file was ported over from an APU2 originally installed as a 2.3.X NanoBSD install, if that matters. Originally I thought it might be something about going from a Nano->Full install, or possibly going from AMD->Intel hardware / NICs. The APU setup was reliable to a fault - the only time we ever had downtime was when installing an update. The new firewall was simply a full Backup / Restore from the old 2.3 setup. It has been unreliable from the moment we put it in (2 months now).
One other thing I have noticed though haven't been able to definitively confirm - I feel like the system may happen more frequently (more quickly) if I leave a connection to the pfSense WebGUI open.
All of that to say - I'm seeing something very similar in terms of behavior, but I have a very different hardware / pfSense configuration.
Are there any log files that persist after boot that we could look in to try to debug this problem?
#3 Updated by Jeff Wischkaemper about 2 years ago
Mike Pastore wrote:
Jeff Wischkaemper wrote:
I have an HP unmanaged switch on the LAN side of the network
Can you try a different switch?
Unfortunately not. 1) I don't have one, 2) it's a production server and 3) the installation is actually in a different state from where I am, making any swap like that quite challenging.