Project

General

Profile

Actions

Bug #11466

closed

PHP exits with signal 11 on SG-3100 when calling PCRE functions

Added by Marcos Mendoza 8 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
PHP Interpreter
Target version:
Start date:
02/19/2021
Due date:
% Done:

0%

Estimated time:
Release Notes:
Default
Affected Plus Version:
21.02
Affected Architecture:
SG-3100

Description

After installing Snort and starting the service on an interface, fails to start and the following is reported on the system logs:

Feb 18 19:21:06 hostname snort[5625]: Commencing packet processing (pid=5625)
Feb 18 19:21:13 hostname kernel: pid 5625 (snort), jid 0, uid 0: exited on signal 10

on another SG3100 when trying to replicate the issue, it shows:

Feb 20 00:30:30     php     64883     [Snort] Updating rules configuration for: WAN ...
Feb 20 00:30:30     php     64883     [Snort] Enabling any flowbit-required rules for: WAN...
Feb 20 00:30:30     php     64883     [Snort] WARNING: Flowbit resolution not done - no rules in /usr/local/etc/snort/rules/ ...
Feb 20 00:30:30     php     64883     [Snort] Building new sid-msg.map file for WAN...
Feb 20 00:30:32     kernel         pid 64883 (php), jid 0, uid 0: exited on signal 11 (core dumped)

Further discussion here:
https://forum.netgate.com/topic/161050/snort-won-t-start-after-upgrade-to-21-02-on-sg-3100/5

This is possibly related to #11444
https://redmine.pfsense.org/issues/11444


Files

pfsense_bug_example.xml (27.3 KB) pfsense_bug_example.xml Arthur Wiebe, 05/19/2021 02:58 PM
patch-dynamic-pcre-recursion-limit.diff (932 Bytes) patch-dynamic-pcre-recursion-limit.diff Lower PCRE recursion limit based on stack size (doesn't prevent the crash -- but may aid in debugging) Jim Pingle, 06/03/2021 09:15 AM
patch-disable-pcrejit-arm.diff (493 Bytes) patch-disable-pcrejit-arm.diff Disable PCRE JIT compiler on 32-bit ARM platforms (works around the PHP crash) Jim Pingle, 06/03/2021 09:34 AM

Related issues

Related to Todo #12004: Disable PCRE JIT to work around PHP PCRE crashes on multi-core 32-bit ARM systemsResolvedJim Pingle06/07/2021

Actions
Actions #1

Updated by Michael Spears 8 months ago

Marcos Mendoza wrote:

After installing Snort and starting the service on an interface, fails to start and the following is reported on the system logs:
[...]

on another SG3100 when trying to replicate the issue, it shows:
[...]

Further discussion here:
https://forum.netgate.com/topic/161050/snort-won-t-start-after-upgrade-to-21-02-on-sg-3100/5

This is possibly related to #11444
https://redmine.pfsense.org/issues/11444

I was able to reproduce on a 3100.

Actions #2

Updated by Bill Meeks 8 months ago

The Signal 10 error occurs when an executable attempts to access a memory address on a non-word aligned boundary in ARM hardware. This issue has occurred previously in Netgate's ARM-based appliances (specifically the SG-3100 and SG-1000). The previous fix was to patch the configure script for the Snort binary in the ARM ports tree to always enable a debug build. This suppresses optimization by the llvm compiler. It is the compiler's optimization choices relative to picking "register load from memory" instructions that do not provide auto-fixup of non-aligned access. Thus the Signal 10 error is thrown and the running process (Snort) aborted.

Check the patch files included for the Snort binary in the ARM Ports tree and verify this file is present: patch-pfSense-ARM317.diff

Actions #3

Updated by Scott Long 8 months ago

I don't think that this is related to https://redmine.pfsense.org/issues/11444.

Actions #4

Updated by Scott Long 8 months ago

  • Target version set to 2.6.0
Actions #5

Updated by Bill Meeks 8 months ago

Scott Long wrote:

I don't think that this is related to https://redmine.pfsense.org/issues/11444.

I agree. The Signal 10 problem is not related to Issue #11444.

The Snort binary's C source code is littered with pointer casts, some of which lead to non-aligned memory access. Finding and fixing all of these is a tall order. As I mentioned in a reply to a thread on the pfSense Forum, there is not much appetite upstream to invest the necessary time and energy to find and fix the wayward pointer casts. This is mostly due to the fact Snort use on ARM hardware is rare. As I also mentioned in the forum thread, Intel CPUs will auto-fixup the non-aligned access performed by the wayward pointer casting. I debugged this issue rather extensively a couple of years back with an SG-3100 and idenfied the exact two ARM binary instructions that cause the problem. They are LDM and STM. Their auto-fixup equivalents (albeit a littler slower to execute because of the fixup) are LDR and STR. When the llvm compiler is in optimizing mode, it will choose to use the LDM/STM instructions for speed. However, in some places in the Snort binary C source code, that leads to non-aligned access violations. And where this happens can appear to be random because it takes a particular sequence of data to trigger the action that produces the non-aligned access. When optimization is turned off, the compiler chooses the LDR/STR instructions and thus the same errant C source code will now work. It still produces non-aligned access violations, but instead of triggering the interrupt to halt execution, the CPU will fixup the access by converting it into a series of single-byte accesses that work on non-word boundary addresses.

It's worth noting that Snort is not the only legacy C source code binary that suffers from this problem on ARM hardware. Because Intel hardware has historically always just "fixed it for you", the bad pointer casting has continued to exist. It really would be nice if the llvm compiler had a switch to make sure the LDM/STM instructions could be excluded even when optimizing ARM CPU binary code. From my research into ARM architecture, these are the only two instructions that will never auto-fixup a non-aligned memory access.

Finally, if memory serves me correctly, there is a control register in the ARM CPU that has a bit flag for turning auto-fixup of non-aligned access "on" and "off". At the time I was troubleshooting, the state was toggled to "on", so auto-fixup was enabled. But even with that flag set, those two LDM/STM instructions still will NOT perform auto-fixup by design. But all of the other instructions will perform auto-fixup with that flag set.

Actions #6

Updated by Marcos Mendoza 8 months ago

The ARM patch for snort is still there:
https://github.com/pfsense/FreeBSD-ports/blob/devel/security/snort/files/patch-pfSense-ARM317.diff

I tried it on another SG-3100 and got:

Feb 22 20:01:19     php-fpm     96482     [Snort] Updating rules configuration for: ...
Feb 22 20:01:20     php-fpm     96482     [Snort] Enabling any flowbit-required rules for: ...
Feb 22 20:01:20     php-fpm     96482     [Snort] Enabling any flowbit-required rules for: ...
Feb 22 20:01:20     php-fpm     96482     [Snort] Building new sid-msg.map file for ...
Feb 22 20:04:17     nginx         2021/02/22 20:04:17 [error] 69690#100091: *842 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 192.168.137.1, server: , request: "POST /snort/snort_rulesets.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "192.168.137.141", referrer: "https://192.168.137.141/snort/snort_rulesets.php?id=0" 
Feb 22 20:04:25     kernel         pid 96482 (php-fpm), jid 0, uid 0: exited on signal 11 (core dumped)

Given the info so far and some further reproducable testing, the signal 11 exit is a different issue hence warrants a different bug report. I have not been able to reproduce the signal 10 exit yet. I will update the ticket accordingly.

Actions #7

Updated by Marcos Mendoza 8 months ago

  • Subject changed from Snort exit with sig 10 and sig 11 on SG-3100 to Snort exit with sig 10 on SG-3100
Actions #8

Updated by Bill Meeks 8 months ago

Marcos Mendoza wrote:

The ARM patch for snort is still there:
https://github.com/pfsense/FreeBSD-ports/blob/devel/security/snort/files/patch-pfSense-ARM317.diff

I tried it on another SG-3100 and got:
[...]

Given the info so far and some further reproducable testing, the signal 11 exit is a different issue hence warrants a different bug report. I have not been able to reproduce the signal 10 exit yet. I will update the ticket accordingly.

The logged error messages are quite unusual. There should be an interface name shown after the colon in each line. Was the interface name sanitized before posting the log snippet? The normal log entry will say something like this (assuming it was the LAN interface):

Feb 22 20:01:19     php-fpm     96482     [Snort] Updating rules configuration for: LAN(em1) ...

If the interface name was not scrubbed from the log entry before posting, that would be something to investigate (why it's missing) and might be a clue to the underlying issue. Also, unless I'm am misunderstanding the log sequence, it appears that php-fpm is what died with Signal 11, not the Snort binary.

Actions #9

Updated by Marcos Mendoza 8 months ago

They were not scrubbed. Here are the steps to reproduce it (was not able to reproduce on a x86 system).

Only Snort installed:
  1. Enable ET Open rules
  2. Force rules update
  3. Go to "Services / Snort / Interfaces"
  4. Click "Add" then switch to "WAN Categories" without saving
  5. Tab is now named "None Categories"
  6. Select a category then click "Save"
Result
  • GUI times out with HTTP 502 error; must refresh
  • Crash seen on system log "pid 19650 (php), jid 0, uid 0: exited on signal 11 (core dumped) "

Only Snort installed:
  1. Enable ET Open rules
  2. Force rules update
  3. Go to "Services / Suricata / Interfaces"
  4. Click "Add", WAN interface is selected, click Save
  5. Go to "WAN Categories"
  6. Select a category then click "Save"
Result:
  • GUI responds correctly
  • No crash in system logs.
  • If the start button is manually clicked on the "Snort / Interfaces" tab, same signal 11 crash appears.

And here's some weird behavior I noticed that maybe as the dev you can answer:

Suricata AND Snort installed (same steps as above):
  1. Enable ET Open rules
  2. Force rules update
  3. Go to "Services / Snort / Interfaces"
  4. Click "Add" then switch to "WAN Categories" without saving
  5. Tab is now named "None Categories"
  6. Select a category then click "Save"
Result:
  • GUI responds correctly
  • No crash in system logs.
  • If the start button is manually clicked on the "Snort / Interfaces" tab, same signal 11 crash appears.

Lastly, I re-did the same tests (possibly with some variation in order) over a dozen times. I ran into some behavior that I have not been able to reproduce consistently:
  • Suricata experienced the same signal 11 crashes following similar steps. Someone else managed to reproduce it as well.
  • Snort started successfully
  • Snort crashed with signal 10:
    Feb 23 00:40:20     php-fpm     15106     Starting Snort on WAN(mvneta2) per user request...
    Feb 23 00:40:20     php     21226     [Snort] Updating rules configuration for: WAN ...
    Feb 23 00:40:22     php     21226     [Snort] Enabling any flowbit-required rules for: WAN...
    Feb 23 00:40:22     php     21226     [Snort] Enabling any flowbit-required rules for: WAN...
    Feb 23 00:40:22     php     21226     [Snort] Building new sid-msg.map file for WAN...
    Feb 23 00:40:22     php     21226     [Snort] Snort START for WAN(mvneta2)...
    Feb 23 00:40:23     kernel         mvneta2: promiscuous mode enabled
    Feb 23 00:40:25     kernel         pid 25778 (snort), jid 0, uid 0: exited on signal 10
    Feb 23 00:40:25     kernel         mvneta2: promiscuous mode disabled 
    

Hopefully this is enough to give a clue as to what's going on.

Actions #10

Updated by Bill Meeks 8 months ago

So to make sure I understand, this only happens on an SG-3100 and you can't reproduce on x86 hardware.

The first test case, adding an interface and NOT saving the change but then clicking the CATEGORIES tab (and probably any other tab) results in a GUI timeout and Signal 11 crash in php-fpm, is likely due to an unexpected condition. All of the tabs (except INTERFACE SETTINGS, when adding a new interface) expect a valid interface to be passed to them either via query string (via $_GET) or as a $_POST parameter. So the fix there is likely just additional validation to prevent a user from doing that sequence of steps as it is an undefined scenario. There would never be a reason to attempt to view CATEGORIES or any other Snort parameters for a non-existent interface (meaning one that was not saved).

The second case, adding an interface and saving it before leaving the INTERFACE SETTINGS tab, is stranger. It only triggers the Signal 11 from php-fpm when attempting a manual start. Don't really know what's happening there. The manual start icon, when clicked, launches a background process to start Snort on the interface and then uses a jQuery postback loop to monitor the status of the startup. When it detects the correct PID file in /var/run, then it updates the icon from "starting" to "running". So apparently the above is not happening as designed on the SG-3100 for some reason.

Your final case with both Snort and Suricata installed is very strange. Are you 100% positive of those steps and that result? I ask because there is zero interaction between Snort and Suricata in the GUI. All of their files are in separate sub-directories. The only thing they share is access to the snort2c pf table when Legacy Mode Blocking is used.

Can you give the version number of the Snort package you are using on the SG-3100? Is it the same as the one in the x86 CE package repository? Is the Snort binary 2.9.17?

And I thought of something else that might help. Can you grep the dmesg log (or elsewhere) to see if php-fpm is logging any additional information about its crash? Specifically I'm hoping it will log which library (or plugin) it is executing when the crash happens. The fact this does not happen on x86 hardware makes me wonder if there is actually a problem in one of the libraries for PHP, and Snort happens it to tickle it just right so it triggers.

Actions #11

Updated by Marcos Mendoza 8 months ago

The behavior with both Snort and Suricata installed was definitely strange and didn't make sense to me. I did a fresh install and spent some more time testing to try narrowing it down.

The errors that can trigger are:
  • kernel         pid 56473 (*php*), jid 0, uid 0: exited on signal 11 (core dumped)
    
  • kernel         pid 48675 (*php-fpm*), jid 0, uid 0: exited on signal 11 (core dumped)
    
  • kernel         pid 4696 (*snort*), jid 0, uid 0: exited on signal 10
    
I can trigger the errors with ease, but what exactly triggers them... is something I hope this helps find:
  1. Go to "Services / Snort"
  2. Add interface; click Save
  3. Click on "Snort Interfaces" to return to interface list
  4. Edit interface, switch to categories
  5. Select rule, save (may trigger php-fpm error)
  6. Go to "Snort Interfaces" to return to interface list
  7. EITHER: Click on Start button (likely trigger php error; may start normally)
  8. OR: Go to "Status / Services" (likely trigger sig 10 snort error)
  9. Go to "Snort Interfaces", click Stop button
  10. Delete interface
  11. Repeat steps

When it starts normally, it says:

php     25214     [Snort] Snort START for WAN(mvneta2)...

Checking logs:
  • dmesg only shows the same signal 10/11 errors seen in system log
  • No luck trying to enable logging in /usr/local/etc/php-fpm.d/www.conf
  • /tmp/php_errors.txt also kept empty

Using:
snort-2.9.17
pfSense-pkg-snort-4.1.3_1

Actions #12

Updated by Bill Meeks 8 months ago

Thanks for the additional info. I will investigate further. The Signal 10 from the Snort binary I am not really surprised about. There are many places in the Snort binary's C source code where memory pointer casts between types are done, and those are likely the source of the Signal 10 bus errors (which are actually non-aligned memory access aborts). The frequency and repeatibility of the Signal 10 errors was influenced by the particular Snort configuration and traffic at the time. This is because the program execution steps are in many ways dependent on the traffic (data) being analyzed at a particular moment. Things like if-then tests can result in different execution paths where one path works and the other gives you a Signal 10.

The GUI issues are more perplexing. Really can't imagine how that is killing php-fpm.

I am rebuilding my package repository and test environment for the 2.5.0-RELEASE branch. That will take a few hours to complete, then I can look into this in earnest. I loaned my SG-3100 to our church for production use, so the only ARM appliance I currently have is an SG-1100. I will fire it up and see if the issue can be reproduced there, since it's also an ARM device (albeit a 64-bit one and not 32-bit like the SG-3100).

Bill

Actions #13

Updated by Bill Meeks 8 months ago

Marcos:

I'm running into difficulty updating my SG-1100 to the latest version. It is still on the 2.4.4 factory image. I am unable to get it to see the 2.4.5 branch, nor can I get it to see the new 21.02 branch. And it is also unable to see any packages (even on the 2.4.4 branch), so I can't install Snort.

Got an image file from Netgate Support to update the SG-1100 to 21.02-RELEASE.

Can we take this offline to an email thread? I assume you are a Netgate employee. You can contact me here "bill at themeeks dot net" (actual email obfuscated a bit to hinder web scraping).

Thanks,
Bill

Actions #14

Updated by Bill Meeks 8 months ago

Another Update

None of the conditions described in this bug report occur on an SG-1100 (64-bit ARM CPU), and neither do they occur on x86-64 hardware. My theory is the Signal 11 event for php-fpm is something related to 32-bit code. Might be some kind of integer overflow or something, and Snort's GUI PHP code happens to tickle it just right to trigger the event. We don't have any 32-bit x86 images to test this with, though. It would be interesting to see if these problems happen on 32-bit x86 platforms.

Since I can't reproduce this on the hardware I currently have to test with, I'm stalled with troubleshooting the issues any further.

I do have a fix for the essentially cosmetic issue of adding an interface, but not saving the change, and then browsing to another Snort GUI tab that expects a valid interface to be passed. The fixed code simply hides all of the Snort GUI tabs that expect a valid interface ID to be passed to them when adding a new interface. I will incorporate this fix into the next Snort package update.

I will be glad to help troubleshoot the SG-3100 issue remotely if we can work that out.

Actions #15

Updated by Bill Meeks 8 months ago

Update on this issue

The problem is somewhere within the PHP base function preg_match().

Here is a PHP code snippet that works fine. This is lifted from the actual Snort GUI code. Notice the last call to preg_match() is commentend out.

<?php

$rule1 = 'alert ( msg:"DECODE_NOT_IPV4_DGRAM"; sid:1; gid:116; rev:1; metadata:rule-type decode; classtype:protocol-command-decode;)';

$matches = array();

if (preg_match('/\bmsg\s*:\s*"(.+?)"\s*;/i', $rule1, $matches))
      $msg = trim($matches[1]);
if (preg_match('/\bsid\s*:\s*(\d+)\s*;/i', $rule1, $matches))
      $sid = trim($matches[1]);
if (preg_match('/\bgid\s*:\s*(\d+)\s*;/i', $rule1, $matches))
      $gid = trim($matches[1]);
if (preg_match('/\brev\s*:\s*([^\;]+)/i', $rule1, $matches))
      $rev = trim($matches[1]);
//if (preg_match('/\bclasstype\s*:\s*([^\;]+)/i', $rule1, $matches))
//      $classtype = trim($matches[1]);

print $msg . "\n";
print $sid . "\n";
print $gid . "\n";
print $rev . "\n";
print $classtype . "\n";

?>

The code snippet below will fail and trigger a Signal 11 core dump in PHP on the last call to preg_match(). Notice the comments are now removed for that last call.

<?php

$rule1 = 'alert ( msg:"DECODE_NOT_IPV4_DGRAM"; sid:1; gid:116; rev:1; metadata:rule-type decode; classtype:protocol-command-decode;)';

$matches = array();

if (preg_match('/\bmsg\s*:\s*"(.+?)"\s*;/i', $rule1, $matches))
      $msg = trim($matches[1]);
if (preg_match('/\bsid\s*:\s*(\d+)\s*;/i', $rule1, $matches))
      $sid = trim($matches[1]);
if (preg_match('/\bgid\s*:\s*(\d+)\s*;/i', $rule1, $matches))
      $gid = trim($matches[1]);
if (preg_match('/\brev\s*:\s*([^\;]+)/i', $rule1, $matches))
      $rev = trim($matches[1]);
if (preg_match('/\bclasstype\s*:\s*([^\;]+)/i', $rule1, $matches))
      $classtype = trim($matches[1]);

print $msg . "\n";
print $sid . "\n";
print $gid . "\n";
print $rev . "\n";
print $classtype . "\n";

?>

And even more curious, this code below will also run successfully. Notice all the calls to preg_match() except the last one are commented out this time. This indicates to me there is nothing wrong with the regex expression in that last function call. Rather it is something in PHP itself.

<?php

$rule1 = 'alert ( msg:"DECODE_NOT_IPV4_DGRAM"; sid:1; gid:116; rev:1; metadata:rule-type decode; classtype:protocol-command-decode;)';

$matches = array();

//if (preg_match('/\bmsg\s*:\s*"(.+?)"\s*;/i', $rule1, $matches))
//      $msg = trim($matches[1]);
//if (preg_match('/\bsid\s*:\s*(\d+)\s*;/i', $rule1, $matches))
//      $sid = trim($matches[1]);
//if (preg_match('/\bgid\s*:\s*(\d+)\s*;/i', $rule1, $matches))
//      $gid = trim($matches[1]);
//if (preg_match('/\brev\s*:\s*([^\;]+)/i', $rule1, $matches))
//      $rev = trim($matches[1]);
if (preg_match('/\bclasstype\s*:\s*([^\;]+)/i', $rule1, $matches))
      $classtype = trim($matches[1]);

print $msg . "\n";
print $sid . "\n";
print $gid . "\n";
print $rev . "\n";
print $classtype . "\n";

?>

This code snippet only fails on 32-bit ARM hardware. There are no issues with this exact same code on 64-bit hardware (SG-1100) nor on x86-64 hardware.

Actions #16

Updated by Steve Yates 8 months ago

Simply out of curiosity I did a quick search and found this "not a bug" from 2008: https://bugs.php.net/bug.php?id=45735 "When running a preg_match with a capturing subpattern against large input, php crashes with a Segmentation Fault"

Actions #17

Updated by Bill Meeks 8 months ago

Steve Yates wrote:

Simply out of curiosity I did a quick search and found this "not a bug" from 2008: https://bugs.php.net/bug.php?id=45735 "When running a preg_match with a capturing subpattern against large input, php crashes with a Segmentation Fault"

Thanks Steve for the info and link. Very interesting!

I was planning on spending some time today searching the web to see if any similar issues had been reported.

In the case of the Snort and Suricata GUI code, it's like this is some kind of cumulative thing. As I said, the calls work individually, and they all work sequentially so long as you comment out that last one. But the last one will work individually with the others commented out.

I'm not a regex guru for sure, but I will see about finding another pattern match string that might work without generating the segfault.

Actions #18

Updated by Bill Meeks 8 months ago

Another day of frustrating, but ultimately not too productive, testing leads me to conclude this is something with 32-bit PHP somewhere. Simply rearranging the sequence of preg_match() calls will result in all of them succeeding. And in the original order of calls, commenting out some will result in success as well.

At first I was thinking the regex grouping criteria was perhaps leading to backtrack recursion problems (especially considering the link posted above by Steve Yates). But if the regex was truly the root cause, you would expect the backtrack recursion crash to happen in the same regex all the time no matter when it was called (when it is called with the same data). The fact simply rearranging the sequence of calls stops the problem seems to eliminate the regex itself as the root cause. These preg_match() calls (6 of them) are designed to pull out specific pieces of a Snort or Suricata text rule. In the actual Snort GUI code they exist in a foreach() iteration that loops over each line in a rules file to process the single rule from that line. So in a given iteration, the preg_match() calls are all working from the same source string.

While I could fix the immediate issue (apparently) on the SG-3100 by simply rearranging the function calls, I am certainly not confident in that as a long-term solution. It is highly likely a slightly different set of circumstances could break it again.

Actions #19

Updated by Marcos Mendoza 8 months ago

  • Subject changed from Snort exit with sig 10 on SG-3100 to Snort exit with sig 11 on SG-3100
  • Affected Architecture SG-3100 added
Actions #20

Updated by Jim Pingle 7 months ago

Has anyone tried this on a 21.05 snapshot with PHP 7.4.16? The release notes for PHP 7.4.16 mention they fixed a segfault in SPL (Standard PHP Library), which could potentially be relevant. Worth trying at least.

Make sure you have an install image for 21.02-p1 on hand in case you have to downgrade.

Actions #21

Updated by Marcos Mendoza 7 months ago

Tested on:

21.05-DEVELOPMENT (arm)
built on Tue Mar 09 11:27:41 EST 2021
FreeBSD 12.2-STABLE

  1. Fresh install on SG-3100
  2. Create /tmp/test.php with example code from Bill above
  3. Ran "php /tmp/test.php"
    [21.05-DEVELOPMENT][root@pfSense.home.arpa]/root: php /tmp/test.php
    Segmentation fault (core dumped)
    

System log shows:

Mar 9 20:53:38     kernel         pid 35324 (php), jid 0, uid 0: exited on signal 11 (core dumped)

I did not install any package - all was done from console.

Actions #22

Updated by Marcos Mendoza 7 months ago

  • Project changed from pfSense Packages to pfSense Plus
  • Subject changed from Snort exit with sig 11 on SG-3100 to PHP exit with sig 11 on SG-3100
  • Category changed from Snort to PHP Interpreter
  • Target version deleted (2.6.0)
  • Affected Version deleted (2.5.0)
  • Affected Plus Version set to 21.02

Updating bug report to focus on PHP issue, given that the snort sig 10 issue is unlikely related, and this seems to affect more than just the snort package.

Actions #23

Updated by Marcos Mendoza 7 months ago

Likely related #11605 and #11551

Actions #24

Updated by Bill Meeks 7 months ago

One of the issues identified in this ticket, the logging of "blank" interface names and the display of "Unknown" as the interface name in some GUI tabs when adding a new interface (or cloning an existing one) have been fixed in Pull Request 1058 here: https://github.com/pfsense/FreeBSD-ports/pull/1058.

Actions #25

Updated by Renato Botelho 6 months ago

Bill Meeks wrote:

One of the issues identified in this ticket, the logging of "blank" interface names and the display of "Unknown" as the interface name in some GUI tabs when adding a new interface (or cloning an existing one) have been fixed in Pull Request 1058 here: https://github.com/pfsense/FreeBSD-ports/pull/1058.

PR merged to devel branches. Will be cherry-picked to stable after some tests

Actions #26

Updated by Arthur Wiebe 5 months ago

As posted to https://forum.netgate.com/topic/163854/sg-3100-crash-on-upgrade-restore-when-using-url-tables-and-openvpn I've found that the attached config that includes a URL IP table and an OpenVPN server combination will trigger the PHP exit reliably in both pfSense 21.02.2 and the latest 21.05-RC.

Actions #27

Updated by Kris Phillips 5 months ago

Tested on the 21.05 RC from May 26th on the SG-3100. This issue is still present.

Actions #28

Updated by Bill Meeks 5 months ago

I have confirmed this PHP segmentation fault issue is an issue only on 32-bit ARM hardware such as that in the SG-3100. I tested an i386 version of FreeBSD-12.2 STABLE in a VMware virtual machine. The PHP code from my post earlier in this thread ran fine in PHP 7.4.19 on a 32-bit FreeBSD-12 STABLE image. The code also runs fine on an SG-1100 I have for testing (64-bit aarch64 platform).

From trouble reports on the Netgate pfSense forum, it seems at least three packages using PHP in their GUI code are known to be able to produce the segmentation fault: Snort, Suricata and pfBlockerNG-devel. Snort and Suricata produce the fault when calling the regex function preg_match(). I'm not sure exactly where the pfBlockerNG-devel code is causing the PHP crash, so I don't know what PHP function is being called. Note the Snort and Suricata bug appears to be triggered only after a particular critical number of successive preg_match() function calls. You can read my write-up with findings earlier in this thread.

Actions #29

Updated by Arthur Wiebe 5 months ago

That might explain why my example config triggers the problem. As preg_match is being used by the PHP code for urltable in /etc/inc/util.inc and for openvpn in /etc/inc/openvpn.inc
The successive calls would be triggering the fault.

Is anyone from Netgate actually looking at this issue? The SG-3100 is a broken product as long as this bug exists in a production release.

Actions #30

Updated by Christian McDonald 5 months ago

A cursory search seems to suggest that the default pcre recursion limit is too high ootb (higher than what can fit in the call stack).

Anyone try reducing pcre.recursion_limit via ini_set in the impacted files?

I’m looking into this too because the WireGuard package uses preg_* though only for very short strings, so unlikely to be an issue

Actions #31

Updated by Jim Pingle 5 months ago

If someone who can readily reproduce the PHP crash wants to try resizing the pcre.recursion_limit automatically based on the stack size, apply the attached patch with the system patches package and then either reboot or run console menu options 16 and 11 to ensure PHP and the GUI get fully reinitialized.

Actions #32

Updated by Jim Pingle 5 months ago

Using the sample code from Note 15 I can still crash it with a low recursion limit, and I also tried lowering pcre.backtrack_limit as well, I even went down to a limit of 1.

Either way I'd expect PHP to consume the stack of the web server before it would endanger its own process, but it was worth a shot.

I did, however, find that disabling pcre.jit prevented the crash.

Try this patch instead.

Reboot after applying the patch to ensure it's used by the entire system properly from the start.

Actions #33

Updated by Jim Pingle 5 months ago

A couple others here have also confirmed that the JIT disable patch has worked around the crash on 3100. I committed that to plus so if our other work to solve the issue in a native way doesn't work out quite yet, we at least will have that as a workaround.

Actions #34

Updated by Arthur Wiebe 5 months ago

The PCRE JIT patch has resolved the issue on two problematic SG-3100 configs that I had sitting here.
Thanks Jim.

Actions #35

Updated by Marcos Mendoza 5 months ago

Given that this issue seems to only affect 32-bit systems, perhaps this is a case of needing to substitute pcre_ functions with pcre32_.

https://www.pcre.org/original/doc/html/pcrejit.html#TOC1

If you are using the 32-bit library, substitute the 32-bit functions and 32-bit structures (for example, pcre32_jit_stack instead of pcre_jit_stack).

Actions #36

Updated by Jim Pingle 5 months ago

We do not use pcre_jit_stack anywhere directly, so there is nothing to change/adjust in that regard. Also reading that page that isn't about 32-bit platforms.

Actions #37

Updated by Kris Phillips 4 months ago

Decided to go through some performance testing and stress testing. I loaded the CPU to maximum with iPerf3 traffic across the LAN to WAN interface with Snort running. No appreciable difference in performance before and after the patch without Snort and Snort seems to run fine now.

CPU does hit a load average of 1.5 and 97% CPU utilization when running full tilt with snort enabled in non-blocking mode and an iPerf3 load. Disabling Snort results in a 35% reduction in both load and CPU utilization, but speed remains around the same. There may be some efficiency loss in Snort with this patch, but since I can't do an A/B comparison running the same firmware version and the same snort version with AND without the patch, I cannot say for certain if this is expected load differences or if there is a noticeable efficiency loss with the JIT removal for PCRE. I'd have to run 2.4.5p1 with the older version of snort to have an even remotely apples to apples comparison.

Either way this appears stable as a fix for now.

Actions #38

Updated by Darin May 4 months ago

Reporting that the patch in #32 solved my 21.02.2 --> 21.05 upgrade w/pfBLockerNG-devel causing the firewall service to core dump.

The patch is set NOT to auto-apply as I presume the workaround or perm fix will be in the 21.09 release and no longer required.

Actions #39

Updated by Kris Phillips 4 months ago

Tested in 21.09 Jun 5th build. This patch is present and no longer needs to be applied manually in the development channel.

Actions #40

Updated by Jim Pingle 4 months ago

  • Related to Todo #12004: Disable PCRE JIT to work around PHP PCRE crashes on multi-core 32-bit ARM systems added
Actions #41

Updated by Jim Pingle 4 months ago

I created #12004 for the temporary workaround via disabling PCRE JIT. This issue can remain open while we investigate long-term solutions to the root cause.

Actions #42

Updated by Darin May 4 months ago

Kris Phillips wrote:

Tested in 21.09 Jun 5th build. This patch is present and no longer needs to be applied manually in the development channel.

I'm not familiar with the criteria for bugs to be listed in the target fix list of open issues, but is it not reasonable that this issue is listed for the next 21.09 list? Otherwise each module affected by the bug may have to redundantly compensate for the PHP bug on their own. I do know the Snort and Suricata or pfBLockerNG-devel tracking bugs have added or are considering adding the fix to their own code instead of waiting for the root cause to be addressed. I'm not sure which strategy is the historically right path.

https://redmine.pfsense.org/projects/pfsense/issues?query_id=186

Actions #43

Updated by Jim Pingle 4 months ago

  • Target version set to 21.09

Darin May wrote:

I'm not familiar with the criteria for bugs to be listed in the target fix list of open issues, but is it not reasonable that this issue is listed for the next 21.09 list? Otherwise each module affected by the bug may have to redundantly compensate for the PHP bug on their own. I do know the Snort and Suricata or pfBLockerNG-devel tracking bugs have added or are considering adding the fix to their own code instead of waiting for the root cause to be addressed. I'm not sure which strategy is the historically right path.

https://redmine.pfsense.org/projects/pfsense/issues?query_id=186

It depends. In this case, while we'd like to fix this ASAP, it's partially blocked by problems upstream in either FreeBSD and/or PHP on 32-bit ARM so even though we are working on trying to get around in various ways we don't know when it will be solved for good. Rather than keep punting the target ahead if we can't get it fixed, we just didn't set a target, but we can certainly do so (it's there now).

That said, the workaround is already set on 21.09: #12004 -- It doesn't show in your link because it's already marked resolved, and that report is only for open issues.

Actions #44

Updated by Darin May 4 months ago

How is the cat-herding addressed so that the work-around isn't duplicated across packages? I've noticed chit-chat in the other tracking bugs per package...some advocating for a local version of the same patch. I presume the redundancy in this case is low-risk, but then we have a new one-to-many dependency of the eventual fix needs to then invalidate the individual package fixes so that when the root cause is addressed the current fix can be backed out.

In short (if that's still possible), who tracks the cascade of changes once the main release makes changes so that packages themselves don't have to compensate with a new patch dependency?

Actions #45

Updated by Jim Pingle 4 months ago

Darin May wrote:

How is the cat-herding addressed so that the work-around isn't duplicated across packages?

It isn't, but it ultimately doesn't matter. Eventually a package could do a version check and only apply the ini_set operation on affected versions, but that's up to each individual package if they choose to do it.

I've noticed chit-chat in the other tracking bugs per package...some advocating for a local version of the same patch. I presume the redundancy in this case is low-risk, but then we have a new one-to-many dependency of the eventual fix needs to then invalidate the individual package fixes so that when the root cause is addressed the current fix can be backed out.

In the packages it isn't a patch that needs backed out, but a simple PHP function call that doesn't alter the base system in any way. It doesn't disable the option globally, but only on the affected code paths.

Since there doesn't seem to be a huge advantage to using PCRE JIT anyhow the risk is low to nonexistent.

In short (if that's still possible), who tracks the cascade of changes once the main release makes changes so that packages themselves don't have to compensate with a new patch dependency?

Each package maintainer would need to handle changes to their own code, should they choose to take any action at all.

Actions #46

Updated by Bill Meeks 4 months ago

Jim Pingle wrote:

Each package maintainer would need to handle changes to their own code, should they choose to take any action at all.

For the Snort and Suricata packages, which I support, I don't intend to incorporate anything into the package PHP code to address this bug, and will instead rely on the fix at the pfSense system level. I feel that is the proper method anyway as PHP itself (a base system package) is the culprit. It's not something the package code is doing "wrong". It is a problem within the PHP engine itself, and the problem seems to be isolated to only 32-bit ARMv7 code.

Several users have reported success with both Snort and Suricata on SG-3100 appliances after applying the PCRE_JIT patch provided by Jim Pingle.

Actions #47

Updated by Darin May 4 months ago

I don't use either Snort or Suricata in operation but I do use pfBLockerNG-devel and the patch has solved the stability issues there too especially blocking restarts and the webconfigurator after an upgrade.

Actions #48

Updated by Clinton Cory 4 months ago

I can confirm that applying the PCRE_JIT patch fixed this problem for me on 21.05.

Actions #49

Updated by Jim Pingle 4 months ago

If anyone is still having issues with PHP crashing on the 3100 after applying the PCRE JIT patch from comment 32 and rebooting, please let us know here or on the forum at https://forum.netgate.com/topic/164725/netgate-3100-php-crashes . There are still a small number of people reporting issues with things other than Snort (which is known) but thus far none of the people reporting problems have followed up with additional information.

We need more details about any ongoing issues, such as log messages or other errors encountered.

Thanks!

Actions #50

Updated by Lucas Lopes Costa 3 months ago

Jim Pingle wrote:

If anyone is still having issues with PHP crashing on the 3100 after applying the PCRE JIT patch from comment 32 and rebooting, please let us know here or on the forum at https://forum.netgate.com/topic/164725/netgate-3100-php-crashes . There are still a small number of people reporting issues with things other than Snort (which is known) but thus far none of the people reporting problems have followed up with additional information.

We need more details about any ongoing issues, such as log messages or other errors encountered.

Thanks!

I have a problem in snort, it does not start the service.
I applied the 2 patches mentioned, until the support staff accessed my appliance and I still have snort stopped.
pfSense v. 21.05 SG-3100

Actions #51

Updated by Jim Pingle 3 months ago

This particular issue was narrowed to only focus on the PHP interpreter problem on SG-3100. Snort itself crashing as you mentioned in the forum thread to which you replied, is not relevant to the PHP crash.

Actions #52

Updated by Steve Wheeler 3 months ago

Testing against the current 21.09 snapshot the disable-pcrejit patch is no longer required.

21.09-DEVELOPMENT (arm)
built on Thu Jul 22 01:10:26 EDT 2021
FreeBSD 12.2-STABLE

That patch is in that snapshot but after reverting it the following test code returns successfully.

$rule1 = 'alert ( msg:"DECODE_NOT_IPV4_DGRAM"; sid:1; gid:116; rev:1; metadata:rule-type decode; classtype:protocol-command-decode;)';

$matches = array();

if (preg_match('/\bmsg\s*:\s*"(.+?)"\s*;/i', $rule1, $matches))
      $msg = trim($matches[1]);
if (preg_match('/\bsid\s*:\s*(\d+)\s*;/i', $rule1, $matches))
      $sid = trim($matches[1]);
if (preg_match('/\bgid\s*:\s*(\d+)\s*;/i', $rule1, $matches))
      $gid = trim($matches[1]);
if (preg_match('/\brev\s*:\s*([^\;]+)/i', $rule1, $matches))
      $rev = trim($matches[1]);
if (preg_match('/\bclasstype\s*:\s*([^\;]+)/i', $rule1, $matches))
      $classtype = trim($matches[1]);

print $msg . "\n";
print $sid . "\n";
print $gid . "\n";
print $rev . "\n";
print $classtype . "\n";

That code coredumps PHP in 21.05 without the patch.

Actions #53

Updated by Jim Pingle 3 months ago

  • Status changed from New to Feedback

Setting to feedback for now, can mark it resolved once we have new snapshots without the patches that disabled PCRE JIT and we can do additional testing.

I reverted the relevant commits which disabled PCRE JIT, so tomorrow's build should be a good testing target.

Actions #54

Updated by Kris Phillips 3 months ago

Tested on SG-3100 on 21.05.1 of pfSense Plus built on July 30th. With blocking mode enabled and running snort I'm unable to crash with sig 11 now.

Actions #55

Updated by Kris Phillips 3 months ago

On reboot testing with 21.05.1 I'm able to consistently get snort to crash after a reboot. The service started normally after being manually started post-reboot and ran fine after the manual start, but the service consistently will not start on its own after reboot and crashes with a sig 10. I rebooted my SG-3100 test firewall 3 times to verify it was consistent.

Actions #56

Updated by Steve Yates 3 months ago

consistently will not start on its own after reboot and crashes with a sig 10

Signal 10 with Snort is a different issue: https://redmine.pfsense.org/issues/12157. Try the suricata4 package instead.

Actions #57

Updated by Kris Phillips 2 months ago

Did we end up with PCRE JIT disabled still in 21.05.1 or was the disabled JIT component re-enabled with the new build environment? If it was the latter this likely can be closed out as we no longer see this issue other than the signal 10 issues with snort specifically.

Actions #58

Updated by Marcos Mendoza 2 months ago

Kris Phillips wrote in #note-57:

Did we end up with PCRE JIT disabled still in 21.05.1 or was the disabled JIT component re-enabled with the new build environment? If it was the latter this likely can be closed out as we no longer see this issue other than the signal 10 issues with snort specifically.

The patch is applied on 21.05.1. It is not applied on 21.09 as mentioned here (though it's unclear what fixed it): https://redmine.pfsense.org/issues/12004

Actions #59

Updated by Jim Pingle 2 months ago

  • Status changed from Feedback to Confirmed

The overall problem is still not solved. 21.05.1 shipped with JIT disabled, but JIT is enabled on 21.09 for testing.

Actually even on a current 21.09 snapshot on a 3100 (21.09.a.20210809.0100), PHP still dumps core with the test code while JIT is enabled. The last time we tested it a couple weeks back it wasn't crashing with JIT enabled on 21.09. So it's still unclear if there is some other factor at play here. Needs more investigation yet.

Actions #60

Updated by Jim Pingle about 1 month ago

  • Status changed from Confirmed to Feedback

Per Mateusz, PHP JIT will need to be disabled on the 3100. There is currently no other way around the crash on multi-CPU 32-bit ARM systems.

Patches to disable JIT on 3100 have been restored in the tree. Should be in snapshots tomorrow.

https://gitlab.netgate.com/pfSense/factory/-/compare/5dc076f9ddcef2107447e6596136aa9c2e7539f2...8684140cd0eaf7c0c7c47fb4905a90c788835111

Actions #61

Updated by Jim Pingle about 1 month ago

  • Assignee set to Mateusz Guzik
Actions #62

Updated by Jim Pingle about 1 month ago

  • Assignee changed from Mateusz Guzik to Jim Pingle
Actions #63

Updated by Jim Pingle about 1 month ago

  • Status changed from Feedback to Closed

Cannot crash PHP with the test code on a current 21.09 snapshot.

Since disabling JIT is the best solution in this case, this issue can now be closed.

Actions #64

Updated by Jim Pingle about 1 month ago

  • Subject changed from PHP exit with sig 11 on SG-3100 to PHP exits with signal 11 on SG-3100 when calling PCRE functions

Updating subject for release notes.

Actions

Also available in: Atom PDF