Regression #15391
openPrior bug #9459 in 2.7.2: pfr_update_stats: assertion failed.
0%
Description
'pfr_update_stats: assertion failed" errors are surfacing in a voluminous quantity in the system log. NAT type = "Pure NAT". Logs messages are coming in throughout the day in batches, average per day is around 20-25K lines.
Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed. Apr 9 04:00:17 pfsense kernel: pfr_update_stats: assertion failed.
Given the 'batches' of each occurrence, it's not clear as to point of origination (the number doesn't seem to correlate to anything readily visible).
It may be useful to have an advanced -> firewall & NAT property to change "set debug" from urgent to "none" (perhaps even misc/loud for testing) to help reduce the sheer volume of log messages that are generated when this condition exists. Suggest that this flag could be "reset" to urgent when either an OS upgrade or any system patch (via System -> Patches) is applied to avoid permanence of this setting. Notion being that "urgent" is for reason.
Updated by A S 8 months ago
This was on a direct install (2.7.2) - no backup restored. Recalled this being present in an earlier version. Nothing comes to mind that would increase verbosity, is there anything specific you would like checked for validation? (Most "default" logging such as "Log firewall default blocks" is disabled to minimize log volume.)
Updated by A S 7 months ago
- set net.bpf.zerocopy_enable=1
- on the external interface configure traffic shaping (HFSC in this case) for outbound
- configure inline IPS on "some" (not all) of the inside interfaces
Inside interfaces with an active [inline] IPS (suricata in this case) will operate correctly and without issue. Inside interfaces without an [inline] IPS will have what appears to be random, irregular corruption that prevents accessing the destination website [on the Internet]. For example, only found a couple (<5 websites) where this occurs however it always occurs with those websites. All other website access experienced zero issue. Whatever "corruption" is happening, it would result in the payload being larger - not truncated inferring the issue originated in an earlier packet causing the client to respond "differently" than it should have (commonly triggering the client to attempt PSK). In the extensions part of the payload it was ~500-600 bytes when accessing websites on the Internet vs. >800 bytes when trying to access websites that would fail. Oddly enough all other UDP/TCP/ICMP (including other SSL/TLS) traffic was not impacted in any fashion. The volume of "drops" in the status page for queues, given overall volume of traffic was not 'out of line', helping to mask the issue.
To re-create:- Test device connected to two VLANs
- VLANA: goes to a firewall interface with IPS in inline mode (NAT'd for external)
- VLANB: goes to a firewall interface without an IPS running on the interface (NAT'd for external)
- VLANC: goes to a firewall interface with IPS in inline mode (no NAT, public IPs)
- when default route is via VLANA - all websites work without issue
- when default route is switched to VLANB - a very small percentage of websites will fail (when failure occurs, it's specific to that website - other websites continue to work without issue)
- when default route is switched to VLANC - all websites work without issue
One just needs to try to access one of the websites or services where this issue occurs - can provide a couple examples via PM, not publicly. As other network/IPs were not already in existence to test "without an IPS, No NAT, public IPs" - cannot comment on whether or not this scenario would fail. Notion is that it would fail (short of actual testing).
Solution: enable IPS on the inside interface(s) that currently lack an inline IPS.
Suspicion is that the pfr_update_stats message(s) were acting like the proverbial canary in the mine that something was being corrupted in transit. Once an IPS was enabled, it solved both the issue of the pfr_update_stats messages and the "seemingly irregular" breaking of the ability to access what appeared to be an extremely small percentage of (irregularly utilized) sites/services. The issue appears to be stem from the BPF zerocopy and traffic shaping being in play where an inline IPS is not present.
For clarity, have seen the 'pfr_update_stats' issue in other environments where traffic shaping is not in play - so this is not the only scenario that triggers that syslog message.