Regression #11444
closedSG-3100 doesn't pass traffic after upgrade to 21.02
0%
Description
After upgrading SG-3100 to pfSense Plus 21.02 NAT stopped working.
Test:
LAN PC (192.168.10.132):
mypc# ping sf.net PING sf.net (216.105.38.13) 56(84) bytes of data.
pfSense states:
# pfctl -ss | grep 216.105.38.13 mvneta1 icmp 216.105.38.13:20459 <- 192.168.10.132:20459 0:0 mvneta2 icmp 192.168.21.100:24313 (192.168.10.132:20459) -> 216.105.38.13:24313 0:0
LAN side:
# tcpdump -qn -i mvneta1 host 216.105.38.13 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on mvneta1, link-type EN10MB (Ethernet), capture size 262144 bytes 18:13:34.513867 IP 192.168.10.132 > 216.105.38.13: ICMP echo request, id 20459, seq 92, length 64 18:13:35.513810 IP 192.168.10.132 > 216.105.38.13: ICMP echo request, id 20459, seq 93, length 64 18:13:36.513679 IP 192.168.10.132 > 216.105.38.13: ICMP echo request, id 20459, seq 94, length 64
WAN side (192.168.21.100 - ISP gateway):
tcpdump -qn -i mvneta2 host 216.105.38.13 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on mvneta2, link-type EN10MB (Ethernet), capture size 262144 bytes 18:13:53.513009 IP 192.168.21.100 > 216.105.38.13: ICMP echo request, id 24313, seq 111, length 64 18:13:53.735893 IP 216.105.38.13 > 192.168.21.100: ICMP echo reply, id 24313, seq 111, length 64 18:13:54.513111 IP 192.168.21.100 > 216.105.38.13: ICMP echo request, id 24313, seq 112, length 64 18:13:54.713096 IP 216.105.38.13 > 192.168.21.100: ICMP echo reply, id 24313, seq 112, length 64
from /tmp/rules.debug:
nat on $WAN1 inet from any to any -> 192.168.21.100/32 port 1024:65535
# pfctl -sn | grep 21.100 nat on mvneta2 inet all -> 192.168.21.100 port 1024:65535
strange issues on boot:
pid 401 (php-cgi), jid 0, uid 0: exited on signal 11 (core dumped) e6000sw0port4: link state changed to UP ovpnc1: link state changed to UP e6000sw0port2: link state changed to UP pid 358 (php-fpm), jid 0, uid 0: exited on signal 11 (core dumped)
LAN PC connected to e6000sw0port2 (untagged)
Files
Updated by Jim Pingle almost 4 years ago
- Tracker changed from Bug to Regression
- Target version set to CE-Next
Updated by Viktor Gurov almost 4 years ago
- Subject changed from NAT not working on SG-3100 to SG-3100 doesn't pass traffic after upgrade to 21.02
after uninstalling Snort and Suricata packages everything works fine
pfSense Plus 21.02 + pfBlockerNG-devel 3.0.0_10
Updated by Jim Pingle almost 4 years ago
If you can re-enable those and test again, monitor the CPU usage, CPU temp, and so on to see if they are unusually high before/during the crashes.
Also, don't attach them here but look on the disk for core files and keep copies of them (e.g. find / -name "*.core"
), they may or may not be present depending on the state of the system at the time of the crash.
Updated by Scott Long over 3 years ago
There is a fix that passes my testing here:
https://reviews.freebsd.org/D28821
The above patch is for FreeBSD HEAD but applies to 12.2-STABLE for pfSense. We're working on a merge and patch release now
Updated by Scott Long over 3 years ago
- Project changed from pfSense to pfSense Plus
- Category changed from Rules / NAT to Rules / NAT
- Status changed from New to In Progress
- Target version changed from CE-Next to Plus-Next
- Affected Version deleted (
2.5.0) - Affected Plus Version set to 21.02
Updated by Daniel Gordon over 3 years ago
Scott Lang, that tracks along the same lines with the issues I was having back in Sep 2020: https://forum.netgate.com/topic/155919/debug-kernel-build-with-witness
I was noticing a kernel memory leak, that was overwriting the priority tracker for the pf_rules_lock, which caused the kernel to enter a deadlock, since it can't release the mutex (rmlock).
Updated by Jim Pingle over 3 years ago
- Status changed from In Progress to Resolved
- Target version changed from Plus-Next to 21.02-p1
Updated by Marco Goetze over 3 years ago
After the Problem occurred first time I applied the quick fix setting to 1 CPU in the loader.conf > hw.ncpu=1
Now I applied 21.02.p1 to SG-3100 yesterday Evening, after the reboot of the Updated I removed the before manually added hw.ncpu=1 from the loader.conf. It didn't take 30 min to bring the Problem back on 21.02.p1 not passing traffic anymore.
Question: Was 21.02.p1 just a quick fix addind a cpu limit to laoder.conf or was the membar already applied? First would explain why after removing hw.ncpu=1 the system got the problem again.
Updated by Viktor Gurov over 3 years ago
same issue after upgrading to 21.02-p1:
pid 833 (php-cgi), jid 0, uid 0: exited on signal 11 (core dumped)
and pf doesn't pass traffic
but if I disable pfBlockerNG (3.0.0_10) and reboot it works again,
If I enable pfBlockerNG (python mode) it stops passing traffic and I see again:
pid 357 (php-fpm), jid 0, uid 0: exited on signal 11 (core dumped)
Updated by Marco Goetze over 3 years ago
What Viktor mentioned could be a reason. In my tested and still failing SG-3100 it also used the pfBlockerNG-dev package.
I will have a look into it and test in different scenarios if this is the reason for the pf lockup.
Updated by Jim Pingle over 3 years ago
Marco Goetze wrote:
Question: Was 21.02.p1 just a quick fix addind a cpu limit to laoder.conf or was the membar already applied? First would explain why after removing hw.ncpu=1 the system got the problem again.
No. It was a real fix in the kernel:
- https://reviews.freebsd.org/D28821
- https://www.netgate.com/blog/pfsense-obscure-bugs-and-code-wizards.html
You may be seeing a different issue or a variation.
Viktor Gurov wrote:
but if I disable pfBlockerNG (3.0.0_10) and reboot it works again,
If I enable pfBlockerNG (python mode) it stops passing traffic and I see again:
[...]
I don't think that's the same issue, but it bears investigating. I'd start a new issue with the details of that so it doesn't get mixed up with this. If research shows it's the same or similar, then the redundant one can be closed.
Updated by Marco Goetze over 3 years ago
Let me share some of mny observartions in the last 3 days.
- hw.ncpu=unset, all non default Packages diabled = Stable running 16h without problems
- hw.ncpu=unset, pfBlocker-dev and avahi enabled = crash after 1-6h most frequent after pfBlocker update run
- hw.ncpu=1, pfBlocker-dev and avahi enabled = stable now since ~15h
When the lockup happens I can only use the Serial Console to reach the FW, if I issue "pfctl -d" afterwards access is restored.
Interessting thing is my syslog seems messed up there are a few entries with totally wrong timestamps maybe NTP not synced. But no Signal 11 in the systemlog even the lock occurs.
For mow I keep all ext. packages disabled.