Project

General

Profile

Regression #11444

SG-3100 doesn't pass traffic after upgrade to 21.02

Added by Viktor Gurov 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Rules / NAT
Target version:
Start date:
02/18/2021
Due date:
% Done:

0%

Estimated time:
Release Notes:
Default
Affected Plus Version:
21.02
Affected Architecture:
SG-3100

Description

After upgrading SG-3100 to pfSense Plus 21.02 NAT stopped working.

Test:

LAN PC (192.168.10.132):

mypc# ping sf.net
PING sf.net (216.105.38.13) 56(84) bytes of data.

pfSense states:

# pfctl -ss | grep 216.105.38.13
mvneta1 icmp 216.105.38.13:20459 <- 192.168.10.132:20459       0:0
mvneta2 icmp 192.168.21.100:24313 (192.168.10.132:20459) -> 216.105.38.13:24313       0:0

LAN side:

# tcpdump -qn -i mvneta1 host 216.105.38.13
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on mvneta1, link-type EN10MB (Ethernet), capture size 262144 bytes
18:13:34.513867 IP 192.168.10.132 > 216.105.38.13: ICMP echo request, id 20459, seq 92, length 64
18:13:35.513810 IP 192.168.10.132 > 216.105.38.13: ICMP echo request, id 20459, seq 93, length 64
18:13:36.513679 IP 192.168.10.132 > 216.105.38.13: ICMP echo request, id 20459, seq 94, length 64

WAN side (192.168.21.100 - ISP gateway):

tcpdump -qn -i mvneta2 host 216.105.38.13
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on mvneta2, link-type EN10MB (Ethernet), capture size 262144 bytes
18:13:53.513009 IP 192.168.21.100 > 216.105.38.13: ICMP echo request, id 24313, seq 111, length 64
18:13:53.735893 IP 216.105.38.13 > 192.168.21.100: ICMP echo reply, id 24313, seq 111, length 64
18:13:54.513111 IP 192.168.21.100 > 216.105.38.13: ICMP echo request, id 24313, seq 112, length 64
18:13:54.713096 IP 216.105.38.13 > 192.168.21.100: ICMP echo reply, id 24313, seq 112, length 64

from /tmp/rules.debug:

nat on $WAN1 inet from any to any -> 192.168.21.100/32 port 1024:65535

# pfctl -sn | grep 21.100
nat on mvneta2 inet all -> 192.168.21.100 port 1024:65535

strange issues on boot:

pid 401 (php-cgi), jid 0, uid 0: exited on signal 11 (core dumped)
e6000sw0port4: link state changed to UP
ovpnc1: link state changed to UP
e6000sw0port2: link state changed to UP
pid 358 (php-fpm), jid 0, uid 0: exited on signal 11 (core dumped)

LAN PC connected to e6000sw0port2 (untagged)

OS-Message Buffer.txt (15.5 KB) OS-Message Buffer.txt Viktor Gurov, 02/18/2021 09:28 AM
Network-Switch Configuration.txt (701 Bytes) Network-Switch Configuration.txt Viktor Gurov, 02/18/2021 09:29 AM

History

#1 Updated by Jim Pingle 3 months ago

  • Tracker changed from Bug to Regression
  • Target version set to CE-Next

#2 Updated by Viktor Gurov 3 months ago

could be related: #11436 #11418

#3 Updated by Viktor Gurov 3 months ago

  • Subject changed from NAT not working on SG-3100 to SG-3100 doesn't pass traffic after upgrade to 21.02

after uninstalling Snort and Suricata packages everything works fine
pfSense Plus 21.02 + pfBlockerNG-devel 3.0.0_10

#4 Updated by Jim Pingle 3 months ago

If you can re-enable those and test again, monitor the CPU usage, CPU temp, and so on to see if they are unusually high before/during the crashes.

Also, don't attach them here but look on the disk for core files and keep copies of them (e.g. find / -name "*.core"), they may or may not be present depending on the state of the system at the time of the crash.

#5 Updated by Scott Long 3 months ago

There is a fix that passes my testing here:

https://reviews.freebsd.org/D28821

The above patch is for FreeBSD HEAD but applies to 12.2-STABLE for pfSense. We're working on a merge and patch release now

#6 Updated by Scott Long 3 months ago

  • Project changed from pfSense to pfSense Plus
  • Category changed from Rules / NAT to Rules / NAT
  • Status changed from New to In Progress
  • Target version changed from CE-Next to Plus-Next
  • Affected Version deleted (2.5.0)
  • Affected Plus Version set to 21.02

#7 Updated by Daniel Gordon 3 months ago

Scott Lang, that tracks along the same lines with the issues I was having back in Sep 2020: https://forum.netgate.com/topic/155919/debug-kernel-build-with-witness

I was noticing a kernel memory leak, that was overwriting the priority tracker for the pf_rules_lock, which caused the kernel to enter a deadlock, since it can't release the mutex (rmlock).

#8 Updated by Jim Pingle 3 months ago

  • Status changed from In Progress to Resolved
  • Target version changed from Plus-Next to 21.02-p1

#9 Updated by Marco Goetze 3 months ago

After the Problem occurred first time I applied the quick fix setting to 1 CPU in the loader.conf > hw.ncpu=1

Now I applied 21.02.p1 to SG-3100 yesterday Evening, after the reboot of the Updated I removed the before manually added hw.ncpu=1 from the loader.conf. It didn't take 30 min to bring the Problem back on 21.02.p1 not passing traffic anymore.

Question: Was 21.02.p1 just a quick fix addind a cpu limit to laoder.conf or was the membar already applied? First would explain why after removing hw.ncpu=1 the system got the problem again.

#10 Updated by Viktor Gurov 3 months ago

same issue after upgrading to 21.02-p1:

pid 833 (php-cgi), jid 0, uid 0: exited on signal 11 (core dumped)

and pf doesn't pass traffic

but if I disable pfBlockerNG (3.0.0_10) and reboot it works again,

If I enable pfBlockerNG (python mode) it stops passing traffic and I see again:

pid 357 (php-fpm), jid 0, uid 0: exited on signal 11 (core dumped)

#11 Updated by Marco Goetze 3 months ago

What Viktor mentioned could be a reason. In my tested and still failing SG-3100 it also used the pfBlockerNG-dev package.
I will have a look into it and test in different scenarios if this is the reason for the pf lockup.

#12 Updated by Jim Pingle 3 months ago

Marco Goetze wrote:

Question: Was 21.02.p1 just a quick fix addind a cpu limit to laoder.conf or was the membar already applied? First would explain why after removing hw.ncpu=1 the system got the problem again.

No. It was a real fix in the kernel:

You may be seeing a different issue or a variation.

Viktor Gurov wrote:

but if I disable pfBlockerNG (3.0.0_10) and reboot it works again,

If I enable pfBlockerNG (python mode) it stops passing traffic and I see again:
[...]

I don't think that's the same issue, but it bears investigating. I'd start a new issue with the details of that so it doesn't get mixed up with this. If research shows it's the same or similar, then the redundant one can be closed.

#13 Updated by Marco Goetze 3 months ago

Let me share some of mny observartions in the last 3 days.

  • hw.ncpu=unset, all non default Packages diabled = Stable running 16h without problems
  • hw.ncpu=unset, pfBlocker-dev and avahi enabled = crash after 1-6h most frequent after pfBlocker update run
  • hw.ncpu=1, pfBlocker-dev and avahi enabled = stable now since ~15h

When the lockup happens I can only use the Serial Console to reach the FW, if I issue "pfctl -d" afterwards access is restored.
Interessting thing is my syslog seems messed up there are a few entries with totally wrong timestamps maybe NTP not synced. But no Signal 11 in the systemlog even the lock occurs.

For mow I keep all ext. packages disabled.

Also available in: Atom PDF