Project

General

Profile

Bug #8973

Traffic not going to Limiter queues

Added by Victor Preatoni 8 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Limiters
Target version:
Start date:
09/28/2018
Due date:
% Done:

100%

Estimated time:
Affected Version:
2.4.4
Affected Architecture:
All

Description

This bug may be related to #8956
But it's a different situation...

To get around bug #8956 I just manually deleted all limiter information in XML config file and started from scratch.
After that, queues can be created properly under each limiter.
See https://forum.netgate.com/assets/uploads/files/1537917051995-1.png

But, after assigning traffic to In/Out pipes, dynamic queues are empty. Tried resetting states, rebooting firewall, but no traffic into the Limiter queue:

Limiters:
00001:   9.500 Mbit/s    0 ms burst 0 
q131073  50 sl. 0 flows (1 buckets) sched 65537 weight 0 lmax 0 pri 0 droptail
 sched 65537 type FIFO flags 0x0 0 buckets 0 active
00002: 950.000 Kbit/s    0 ms burst 0 
q131074  50 sl. 0 flows (1 buckets) sched 65538 weight 0 lmax 0 pri 0 droptail
 sched 65538 type FIFO flags 0x0 0 buckets 0 active

Queues:
q00001  50 sl. 0 flows (256 buckets) sched 1 weight 20 lmax 0 pri 0 droptail
    mask:  0x00 0x00000000/0x0000 -> 0xffffffff/0x0000
q00002  50 sl. 0 flows (256 buckets) sched 1 weight 1 lmax 0 pri 0 droptail
    mask:  0x00 0x00000000/0x0000 -> 0xffffffff/0x0000
q00003  50 sl. 0 flows (256 buckets) sched 2 weight 20 lmax 0 pri 0 droptail
    mask:  0x00 0xffffffff/0x0000 -> 0x00000000/0x0000
q00004  50 sl. 0 flows (256 buckets) sched 2 weight 1 lmax 0 pri 0 droptail
    mask:  0x00 0xffffffff/0x0000 -> 0x00000000/0x0000

Problem started after upgrading from 2.4.3 to 2.4.4

limiters+rule.xml (3.24 KB) limiters+rule.xml Victor Preatoni, 09/28/2018 10:15 AM

Associated revisions

Revision 25d029d1 (diff)
Added by Luiz Souza 6 months ago

Make the WF2Q+ the default scheduler for the dummynet limiters.

The WF2Q+ was the default scheduler in previous versions, it is well tested and support dynamic queues.

Add a note for the FIFO scheduler to make clear that it does not support dynamic queues (by design) and as such, it is working as intended.

Add the scheduler information to Diagnostics -> Limiter Info.

Ticket #8973

Revision fb1d9dca (diff)
Added by Luiz Souza 6 months ago

Make the WF2Q+ the default scheduler for the dummynet limiters.

The WF2Q+ was the default scheduler in previous versions, it is well tested and support dynamic queues.

Add a note for the FIFO scheduler to make clear that it does not support dynamic queues (by design) and as such, it is working as intended.

Add the scheduler information to Diagnostics -> Limiter Info.

Ticket #8973

(cherry picked from commit 25d029d1e31cc3874db82db352cd560a401558df)

History

#1 Updated by Victor Preatoni 8 months ago

This is weird, but if configuring Limiters with CoDel AQM and QFQ Scheduler, it works. Problems exists with default AQM Taildrop and default scheduler FIFO.

Victor Preatoni wrote:

This bug may be related to #8956
But it's a different situation...

To get around bug #8956 I just manually deleted all limiter information in XML config file and started from scratch.
After that, queues can be created properly under each limiter.
See https://forum.netgate.com/assets/uploads/files/1537917051995-1.png

But, after assigning traffic to In/Out pipes, dynamic queues are empty. Tried resetting states, rebooting firewall, but no traffic into the Limiter queue:
[...]

Problem started after upgrading from 2.4.3 to 2.4.4

#2 Updated by Jim Pingle 8 months ago

Far more likely is that it is working properly but just not showing the traffic in the queues in the diagnostic output with certain schedulers.

Has anyone ran tests to see if things are being shaped properly in these cases, without looking at Diag > Limiter info?

#3 Updated by Victor Preatoni 8 months ago

Tried to set a very hard limit on my DownloadLimiter and seems to be shaping properly. Tested with testmy.net

#4 Updated by Bipin Chandra 8 months ago

using limiters with queues works fine with codel and fq_codel its just that we r not able to see it in limiter info and a constant spam of config_aqm Unable to configure flowset, flowset busy!

using limiters without child queues also works and able to see in limiter info but again the constant spam of above message

#5 Updated by Samir Patel 7 months ago

Seeing same as all the aforementioned comments. Taildrop and FIFO do work, but don't show under Diag > Limiter Info. Switch to Codel and QFQ, these also work, and now also show under Diag > Limiter Info.

#6 Updated by Samir Patel 7 months ago

Had to switch back to Taildrop/FIFO, though the limiters are no longer possible to monitor.

With QFQ, getting sudden flood of these and then the whole system crashes:

Oct 9 14:31:03 kernel qfq_dequeue BUG/* non-workconserving leaf */

#7 Updated by Samir Patel 7 months ago

Can see that our traffic shaper is nonfunctional now as of 2.4.4 in terms of per-host dynamic bandwidth shaping.

Work around for the missing queues/inability to create queues in 2.4.4 was to delete all limiters & queues, then recreate them, then reassign queues to firewall rules. Then, found out that the limiter diagnostic info was not functional with Taildrop/FIFO. Codel/QFQ caused system to crash eventually.

Whole point of the setup was to do the amazing per-host dynamic bandwidth dividing that pfsense was so good with. Can confirm now that although the limiters/queues are recreated and working to limit the maximum aggregate bandwidth, the mask by destination (Down_LAN) or sources addresses (Up_LAN) does not seem to work. These queues are under the DownLimiter and UpLimiter limiters. Up_LAN is assigned to In pipe on a LAN interface firewall rule and Down_LAN is assigned to Out pipe. All was working before 2.4.4! The hosts always showed identical traffic during peak usage, dividing the total bandwidth evenly. This is nonfunctional now.

Any way I can directly verify that traffic is actually going through per-host queues?

#8 Updated by Victor Preatoni 7 months ago

Samir Patel wrote:

Had to switch back to Taildrop/FIFO, though the limiters are no longer possible to monitor.

Qith QFQ, getting sudden flood of these and then the whole system crashes:

Oct 9 14:31:03 kernel qfq_dequeue BUG/* non-workconserving leaf */

I got that issue a few times too, syslog flooded, and then I had to manually reboot as pfSense crashed completely.

#9 Updated by Samir Patel 7 months ago

Victor Preatoni wrote:

I got that issue a few times too, syslog flooded, and then I had to manually reboot as pfSense crashed completely.

With further testing, I don't see any evidence that Taildrop/FIFO works to actually have the traffic go through child queues. Codel/QFQ does, but that eventually crashes the system. Finally, Codel/Round-robin seemed to actually work (as it used to with Taildrop/FIFO). Traffic goes to child queues and the mask by source/destination addresses functions as expected to dynamically shape bandwidth per-host. Would be nice if Taildrop/FIFO work and then we can compare with the performance of Codel/Round-robin.

#10 Updated by Terence Kent 7 months ago

A quick data point to confirm what Victor and Samir observed:

  • I run two pfsense boxes at different locations. The key feature we rely on is using limiters to shape traffic to handle small-bandwidth internet connections and lots of users.
  • In both deployments, due to #8956, I had to re-create our limiter/queues after upgrading to 2.4.4.
  • Due to this bug, I had to set the configuration of the limiters to use Codel/QFQ and the queues to Codel for shaping to work. I verified with a basic test network that queues are not working with TailDrop/FIFO.
  • Every 10-24 hours, both boxes would crash after producing a bunch of errors along the lines of...
       kernel qfq_dequeue BUG/* non-workconserving leaf */
       
  • After trying to use to Codel / PRIO on the limiters, to avoid the crashes, I ended up in a crash-loop and had to boot to a recovery disk, and manually update the `conf.xml` to stop using PRIO.

At this point, I've just disabled the limiters / queues. It's better for people to deal with the problems caused by no traffic shaping than have no network access :-).

FWIW, I really think this should be addressed in the same release as #8956. Unless you can both define queues and use them, the feature is still effectively broken.

#11 Updated by Samir Patel 7 months ago

Terence Kent wrote:

At this point, I've just disabled the limiters / queues. It's better for people to deal with the problems caused by no traffic shaping than have no network access :-).

Try Codel/Round-Robin. This seems to work and has been stable a couple of days now.

#12 Updated by Victor Preatoni 7 months ago

Samir Patel wrote:

Terence Kent wrote:

At this point, I've just disabled the limiters / queues. It's better for people to deal with the problems caused by no traffic shaping than have no network access :-).

Try Codel/Round-Robin. This seems to work and has been stable a couple of days now.

I'm trying it now. Let's see how it goes.

Big shame this bug report has not been escalated to Critical Priority. Server having kernel panics, hanging or rebooting themselves is SERIOUS ISSUE.

#13 Updated by Terence Kent 7 months ago

Samir Patel wrote
...Try Codel/Round-Robin. This seems to work and has been stable a couple of days now.

Thanks! I'm trying that combination at the less-critical location now, I'll update here if I see crashes again.

Victor Preatoni wrote
...Big shame this bug report has not been escalated to Critical Priority. Server having kernel panics, hanging or rebooting themselves is SERIOUS ISSUE.

I agree. This is one of those bugs that is worse than it seems at first blush. Since the kernel panic loop I encountered was caused by the pfsense configuration I selected in the UI, that means HA setups that have replicated configuration can be be taken down. When HA setups go down, there are lots of uncomfortable meetings afterwards.

#14 Updated by Steve Beaver 7 months ago

  • Target version set to 2.4.4-p1

#15 Updated by Steve Beaver 7 months ago

  • Assignee set to Luiz Souza

#16 Updated by Luiz Souza 6 months ago

  • Status changed from New to In Progress

#17 Updated by Luiz Souza 6 months ago

  • Status changed from In Progress to Feedback
  • % Done changed from 0 to 100

Sorry everyone, there is some confusion around this bug.

The FIFO scheduler never was the default scheduler and the documentation clearly states that all the packets are stored in a single queue and thus, does not support dynamic queues.

The default scheduler was WF2Q+ which works fine with dynamic queues.

So I changed the default scheduler on GUI and added a couple of notes to try to avoid future misunderstandings.

I have also added the scheduler debug data to Diagnostics -> Limiter Info.

As for the broken schedulers (QFQ, PRIO), let's open new tickets to better track these issues.

#18 Updated by Jim Pingle 6 months ago

  • Status changed from Feedback to Resolved

Looks good here. New limiters have WF2Q+ as default. When editing a saved limiter with that scheduler, the new description shows. Limiter info screen now shows scheduler info.

#19 Updated by Victor Preatoni 6 months ago

Thanks Luiz and Jim!

While on 2.4.4, I manually switched to Worst-case Weighted fair Queueing (WF2Q+) and seems to be working fine.

#20 Updated by Terence Kent 5 months ago

I just noticed the updates - thanks for the fix and explanation Luiz!

Also available in: Atom PDF