Project

General

Profile

Actions

Regression #11444

closed

SG-3100 doesn't pass traffic after upgrade to 21.02

Added by Viktor Gurov almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Rules / NAT
Target version:
Start date:
02/18/2021
Due date:
% Done:

0%

Estimated time:
Release Notes:
Affected Plus Version:
21.02
Affected Architecture:
SG-3100

Description

After upgrading SG-3100 to pfSense Plus 21.02 NAT stopped working.

Test:

LAN PC (192.168.10.132):

mypc# ping sf.net
PING sf.net (216.105.38.13) 56(84) bytes of data.

pfSense states:

# pfctl -ss | grep 216.105.38.13
mvneta1 icmp 216.105.38.13:20459 <- 192.168.10.132:20459       0:0
mvneta2 icmp 192.168.21.100:24313 (192.168.10.132:20459) -> 216.105.38.13:24313       0:0

LAN side:

# tcpdump -qn -i mvneta1 host 216.105.38.13
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on mvneta1, link-type EN10MB (Ethernet), capture size 262144 bytes
18:13:34.513867 IP 192.168.10.132 > 216.105.38.13: ICMP echo request, id 20459, seq 92, length 64
18:13:35.513810 IP 192.168.10.132 > 216.105.38.13: ICMP echo request, id 20459, seq 93, length 64
18:13:36.513679 IP 192.168.10.132 > 216.105.38.13: ICMP echo request, id 20459, seq 94, length 64

WAN side (192.168.21.100 - ISP gateway):

tcpdump -qn -i mvneta2 host 216.105.38.13
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on mvneta2, link-type EN10MB (Ethernet), capture size 262144 bytes
18:13:53.513009 IP 192.168.21.100 > 216.105.38.13: ICMP echo request, id 24313, seq 111, length 64
18:13:53.735893 IP 216.105.38.13 > 192.168.21.100: ICMP echo reply, id 24313, seq 111, length 64
18:13:54.513111 IP 192.168.21.100 > 216.105.38.13: ICMP echo request, id 24313, seq 112, length 64
18:13:54.713096 IP 216.105.38.13 > 192.168.21.100: ICMP echo reply, id 24313, seq 112, length 64

from /tmp/rules.debug:

nat on $WAN1 inet from any to any -> 192.168.21.100/32 port 1024:65535

# pfctl -sn | grep 21.100
nat on mvneta2 inet all -> 192.168.21.100 port 1024:65535

strange issues on boot:

pid 401 (php-cgi), jid 0, uid 0: exited on signal 11 (core dumped)
e6000sw0port4: link state changed to UP
ovpnc1: link state changed to UP
e6000sw0port2: link state changed to UP
pid 358 (php-fpm), jid 0, uid 0: exited on signal 11 (core dumped)

LAN PC connected to e6000sw0port2 (untagged)


Files

OS-Message Buffer.txt (15.5 KB) OS-Message Buffer.txt Viktor Gurov, 02/18/2021 09:28 AM
Network-Switch Configuration.txt (701 Bytes) Network-Switch Configuration.txt Viktor Gurov, 02/18/2021 09:29 AM
Actions #1

Updated by Jim Pingle almost 4 years ago

  • Tracker changed from Bug to Regression
  • Target version set to CE-Next
Actions #2

Updated by Viktor Gurov almost 4 years ago

could be related: #11436 #11418

Actions #3

Updated by Viktor Gurov almost 4 years ago

  • Subject changed from NAT not working on SG-3100 to SG-3100 doesn't pass traffic after upgrade to 21.02

after uninstalling Snort and Suricata packages everything works fine
pfSense Plus 21.02 + pfBlockerNG-devel 3.0.0_10

Actions #4

Updated by Jim Pingle almost 4 years ago

If you can re-enable those and test again, monitor the CPU usage, CPU temp, and so on to see if they are unusually high before/during the crashes.

Also, don't attach them here but look on the disk for core files and keep copies of them (e.g. find / -name "*.core"), they may or may not be present depending on the state of the system at the time of the crash.

Actions #5

Updated by Scott Long almost 4 years ago

There is a fix that passes my testing here:

https://reviews.freebsd.org/D28821

The above patch is for FreeBSD HEAD but applies to 12.2-STABLE for pfSense. We're working on a merge and patch release now

Actions #6

Updated by Scott Long almost 4 years ago

  • Project changed from pfSense to pfSense Plus
  • Category changed from Rules / NAT to Rules / NAT
  • Status changed from New to In Progress
  • Target version changed from CE-Next to Plus-Next
  • Affected Version deleted (2.5.0)
  • Affected Plus Version set to 21.02
Actions #7

Updated by Daniel Gordon almost 4 years ago

Scott Lang, that tracks along the same lines with the issues I was having back in Sep 2020: https://forum.netgate.com/topic/155919/debug-kernel-build-with-witness

I was noticing a kernel memory leak, that was overwriting the priority tracker for the pf_rules_lock, which caused the kernel to enter a deadlock, since it can't release the mutex (rmlock).

Actions #8

Updated by Jim Pingle almost 4 years ago

  • Status changed from In Progress to Resolved
  • Target version changed from Plus-Next to 21.02-p1
Actions #9

Updated by Marco Goetze almost 4 years ago

After the Problem occurred first time I applied the quick fix setting to 1 CPU in the loader.conf > hw.ncpu=1

Now I applied 21.02.p1 to SG-3100 yesterday Evening, after the reboot of the Updated I removed the before manually added hw.ncpu=1 from the loader.conf. It didn't take 30 min to bring the Problem back on 21.02.p1 not passing traffic anymore.

Question: Was 21.02.p1 just a quick fix addind a cpu limit to laoder.conf or was the membar already applied? First would explain why after removing hw.ncpu=1 the system got the problem again.

Actions #10

Updated by Viktor Gurov almost 4 years ago

same issue after upgrading to 21.02-p1:

pid 833 (php-cgi), jid 0, uid 0: exited on signal 11 (core dumped)

and pf doesn't pass traffic

but if I disable pfBlockerNG (3.0.0_10) and reboot it works again,

If I enable pfBlockerNG (python mode) it stops passing traffic and I see again:

pid 357 (php-fpm), jid 0, uid 0: exited on signal 11 (core dumped)

Actions #11

Updated by Marco Goetze almost 4 years ago

What Viktor mentioned could be a reason. In my tested and still failing SG-3100 it also used the pfBlockerNG-dev package.
I will have a look into it and test in different scenarios if this is the reason for the pf lockup.

Actions #12

Updated by Jim Pingle almost 4 years ago

Marco Goetze wrote:

Question: Was 21.02.p1 just a quick fix addind a cpu limit to laoder.conf or was the membar already applied? First would explain why after removing hw.ncpu=1 the system got the problem again.

No. It was a real fix in the kernel:

You may be seeing a different issue or a variation.

Viktor Gurov wrote:

but if I disable pfBlockerNG (3.0.0_10) and reboot it works again,

If I enable pfBlockerNG (python mode) it stops passing traffic and I see again:
[...]

I don't think that's the same issue, but it bears investigating. I'd start a new issue with the details of that so it doesn't get mixed up with this. If research shows it's the same or similar, then the redundant one can be closed.

Actions #13

Updated by Marco Goetze almost 4 years ago

Let me share some of mny observartions in the last 3 days.

  • hw.ncpu=unset, all non default Packages diabled = Stable running 16h without problems
  • hw.ncpu=unset, pfBlocker-dev and avahi enabled = crash after 1-6h most frequent after pfBlocker update run
  • hw.ncpu=1, pfBlocker-dev and avahi enabled = stable now since ~15h

When the lockup happens I can only use the Serial Console to reach the FW, if I issue "pfctl -d" afterwards access is restored.
Interessting thing is my syslog seems messed up there are a few entries with totally wrong timestamps maybe NTP not synced. But no Signal 11 in the systemlog even the lock occurs.

For mow I keep all ext. packages disabled.

Actions

Also available in: Atom PDF