Bug #15612: Captive Portal with big number of passththrough MAC addresses is causing webgui gateway timeouts, Error 50x, and HA-sync XMLRPC Error - pfSense - pfSense bugtracker

Actions

Copy link

Bug #15612

open

Captive Portal with big number of passththrough MAC addresses is causing webgui gateway timeouts, Error 50x, and HA-sync XMLRPC Error

Added by Thomas Hohm 11 months ago. Updated 6 months ago.

Status:

New

Priority:

High

Assignee:

Category:

Captive Portal

Target version:

Start date:

Due date:

% Done:

Estimated time:

Plus Target Version:

Release Notes:

Default

Affected Version:

Affected Architecture:

Description

Forum discussion:
https://forum.netgate.com/topic/188936/captive-portal-with-big-number-of-passththrough-mac-addresses-is-causing-webgui-gateway-timeouts-error-50x-and-ha-sync-xmlrpc-error-broken-or-quantity-limitations/8

Actions

Copy link

Updated by Thomas Hohm 11 months ago

Sorry, submitted by accident without details, here are the details to it:

The problematic behaviours:

1. Editing firewall rules: when I try to edit/save firewall rules, it takes a long time until it is completed; it happens often, that we get a nginx gateway timeout during saving.
2. Editing captive portal zone: when we edit the zone with the high number of passthrough MAC addresses, saving takes a very long time and causes 50x error. The crash reporter does not show any error (see output below), the syslog shows a message about "upstream timed out" (see below).
3. HA sync is failing with xmlrpc default socket timeout (see below)

In some cases the web ui is accessable after some minutes again, in some cases I have to use the SSH cli menu to restart php-fpm in order to make the web ui accessable again.

Crash Reporter:

Crash report begins.  Anonymous machine information:

amd64
15.0-CURRENT
FreeBSD 15.0-CURRENT #0 plus-RELENG_24_03-n256311-e71f834dd81: Fri Apr 19 00:28:14 UTC 2024     root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-24_03-main/obj/amd64/Y4MAEJ2R/var/jenkins/workspace/pfSense-Plus-snapshots-24_03-main/sources/FreeBS

Crash report details:

No PHP errors found.

No FreeBSD crash data found.

XMLRPC alert:

A communications error occurred while attempting to call XMLRPC method restore_config_section: Request timed out due to default_socket_timeout php.ini setting @ 2024-06-26 11:47:28

Syslog entry:

2024/06/28 08:07:14 [error] 18824#101717: *3816 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.10.100.11, server: , request: "POST /services_captiveportal.php?zone=mconweb_premium HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.10.100.64:8080", referrer: "https://10.10.100.64:8080/services_captiveportal.php?zone=mconweb_premium"

- the behaviour is the same in Version 23.05 (tested) and also at least 1 version prior to it (as I can remember out of my head)
- we are using a ha cluster of 2x Netgate 1537 with 32 GB RAM & 500 GB SSD each
- we have 600+ mac addresses in the captive portal zone for automatic passthrough. The problems do not occur below 100 addresses.
- we have 2 captive portal zones in total, one with 6ßß+ mac addresses, the other with 0 mac addresses
- we are not using captive portal vouchers (we are using radius authentication with a radius server on a separate non-pfsense system)
- captive portal zones are included in the ha xmlrpc sync settings
- usualy whe have 1000+ users logged in to the captive portal
- as soon as we delete the captive portal zone, all problems are gone

Actions

Copy link

Updated by Thomas Hohm 11 months ago

addition:
- even excluding captive portal from xmlrpc ha sync does not fix the problem.
- I can also export the captive portal settings to XML and import them to a fresh installed system. Even during the import the web ui responds with error 50x or nginx gateway timeout (I`ve seen both, possible different behaviour between 24.03 and 23.05)

Actions

Copy link

Updated by Karl Ruskowski 9 months ago

We've been having the Same-ish Problem.

Main XMLRPC Error:

A communications error occurred while attempting to call XMLRPC method captive_portal_sync: Unable to connect to tls://172.16.1.252:4444. Error: Operation timed out @ 2024-08-22 07:54:18

Syslog:

Aug  2 11:33:44 pfSense01 php-fpm[45974]: /rc.carpmaster: A communications error occurred while attempting to call XMLRPC method captive_portal_sync: Unable to connect to tls://172.16.1.252:4444. Error: Operation timed out

2x Netgate Hardware Version 23.09.1-RELEASE on both
Any changes in the configuration result in many of these errormessages.

Actions

Copy link

Updated by Karl Ruskowski 9 months ago

I was able to solve our problem. Our firewalls weren't syncing at all at closer inspection. I set the same Options under System -> advanced settings -> Webconfigurator and the sync began working again.

Actions

Copy link

Updated by Danilo Zrenjanin 6 months ago

Priority changed from Normal to High

I successfully replicated the observed behavior. Both High Availability (HA) nodes were operating on the 24.03 release. Initially, there were two zones with a total of 345 MAC address pass-through entries. The XML-RPC was failing, as indicated by the following logs:

Nov 21 15:12:35 php-fpm 4777 /rc.filter_synchronize: Retrying XMLRPC Request due to error: A communications error occurred while attempting to call XMLRPC method host_firmware_version: Request timed out due to default_socket_timeout php.ini setting.

Upon removing the second zone, which contained 88 entries, the XML-RPC functioned without issues. It is noteworthy that the firewall had no additional packages installed and was configured with only two interfaces during the testing phase.

Actions

Copy link

Updated by Marcos M 6 months ago

Project changed from pfSense Plus to pfSense
Category changed from Captive Portal to Captive Portal
Affected Plus Version deleted (~~24.03~~)

Actions

Copy link

Updated by Timo C 6 months ago

Subject: Ongoing Issues with pfSense+ Following Update
Hello,
We are still encountering the same issues exclusively with pfSense+. Has there been any progress or changes on this matter? The project was migrated from pfSense Plus to pfSense.
Recently, we updated from 24.03 to 24.11-RELEASE (amd64), built on Fri Nov 22, 05:34:00 CET 2024. However, the update continues to cause significant disruptions to the GUI, with erratic behavior persisting.
Additionally, we've observed that one Phase 2 IKEv2 tunnel is no longer syncing properly via HA, which is particularly concerning.
Could you let us know if a fix is in the works or if there's a timeline for a resolution?
Looking forward to your response.
Kind regards,
Timo

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

pfSense

Custom queries

Bug #15612

Captive Portal with big number of passththrough MAC addresses is causing webgui gateway timeouts, Error 50x, and HA-sync XMLRPC Error

Updated by Thomas Hohm 11 months ago

Updated by Thomas Hohm 11 months ago

Updated by Karl Ruskowski 9 months ago

Updated by Karl Ruskowski 9 months ago

Updated by Danilo Zrenjanin 6 months ago

Updated by Marcos M 6 months ago

Updated by Timo C 6 months ago