Project

General

Profile

Actions

Bug #11733

closed

Web interface hangs when gateway link becomes intermittent

Added by Richard Yao over 3 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
Web Interface
Target version:
-
Start date:
03/26/2021
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Default
Affected Version:
2.5.0
Affected Architecture:
All

Description

I have a failing Verizon ONT. The web interface hung when the ONT first started to fail. Logging into pfsense using SSH works fine and running `sudo /etc/rc.php-fpm_restart` brings the web interface back. When it first started to fail, I would see plenty of messages like this in dmesg:

```
igb0: link state changed to DOWN
config_aqm Unable to configure flowset, flowset busy!
config_aqm Unable to configure flowset, flowset busy!
igb0: link state changed to UP
```

I removed aqm from a couple of ports to see if it helped, but now I just get a long chain of:

```
igb0: link state changed to DOWN
igb0: link state changed to UP
```

This issue first started with 2.4.4, which prompted me to upgrade to 2.5.0. My system is setup with dynamic DNS and monitors the gateway. It also has a few dozen VLANs, but none of them are on the gateway interface. When the web interface is working, I often see "no carrier" listed while the ONT is having an issue. I am confident the port on the ONT is failing, as there are no link layer issues when I plug the ONT side of the patch cable into my switch to test it. Verizon is sending a technician to replace the ONT on Monday.

As for why there is a pfsense bug, the web interface should not hang when there is an issue with the WAN. It made me initially suspect the router, which delayed my call to Verizon to report the failing ONT. It did not help that the SSH interface's option to restart php on 2.4.x didn't work, which made me think something was wrong with the router until I upgraded to 2.5.0 as part of troubleshooting. Then I realized that it was executing `/etc/rc.php-fpm_restart` with user permissions on 2.4.x, which caused that issue.

Anyway, I wish I could provide more relevant information, but the only remotely relevant thing I can say is that the system has a X11SSH-F motherboard. I am really not sure how you would reproduce a condition where the ONT is failing to identify the cause of the hang. I am hoping that you guys can eyeball the cause.

Actions #1

Updated by Jim Pingle over 3 years ago

  • Status changed from New to Rejected

Most likely the rapid cycling of link on the port was causing interface event processing to get backed up in a queue, and restarting PHP-FPM cleared that backlog.

Not much we can do in that kind of case with hardware failure since if we start tossing out interface events then that opens up a different set of problems.

Maybe a case could be made for some kind of detection that an interface is flapping which leads to it being ignored for a while but that's a much different feature than what is being described here.

Actions #2

Updated by Richard Yao over 3 years ago

The ONT was just replaced. Immediately after, I tried to connect to the web interface, but I received a 502 error as if php had been restarting. I SSHed into the system and restarted PHP, which restored the web interface. I could do pings to 1.1.1.1 from pfsense's SSH shell, but I could not ping it from my workstation. It seemed that NAT was not working despite connectivity being restored. I ended up rebooting the pfsense box to restore connectivity.

Jim Pingle wrote:

Most likely the rapid cycling of link on the port was causing interface event processing to get backed up in a queue, and restarting PHP-FPM cleared that backlog.

Not much we can do in that kind of case with hardware failure since if we start tossing out interface events then that opens up a different set of problems.

Could you make the event processing asynchronous so that the web interface does not block on it? If the event processing is asynchronous, you could have the web interface report the state of the queue. Having the web interface report that there were events piling up would have made identification of what was failing much easier than the current behavior does. Under the current behavior, it looked as if the router hardware was failing. That delayed resolution of the problem until I figured out what was actually happening.

Actions

Also available in: Atom PDF