Bug #15156
closedFragmented packets delayed by limiters are lost
0%
Description
Client is having issues with outgoing SIP calls on XG-1537 23.09.1 specifically. KVM host with the same config works correctly.
Possible fragmentation handling issue.
pre-NAT there are 1514-sized packet followed by 87-sized packet of SIP invite, marked as "fragmented/ reassembled in x+1". and post-NAT I only see 1514-sized packets. captures attached
scrub enabling/disabling doesn't have any effect on the issue
Updated by Georgiy Tyutyunnik 11 months ago
- File ipbx-limiter-failed.pcap added
- File sfr-limiter-failed.pcap added
- File ipbx-nolimiter-failed.pcap added
- File ipbx-nolimiter-success.pcap added
- File sfr-nolimiter-failed.pcap added
- File sfr-nolimiter-success.pcap added
- File rules (nolimiter-nosuccess).debug added
- File rules (wlimiters-nosuccess).debug added
- File rules (nolimiters-success).debug added
- File netstat-afterreboot.txt added
testing session with client 17.01:
calls had a big chance of success without limiters enabled for the rule on ix0.1251, but we were not able to reproduce it stably. during the tests with tso disabled on their wan we had to reboot after reenabling it due to the ix1 unable to process traffic to gateway.
rules.debug, netstat and pcaps attached
Updated by Kristof Provost 11 months ago
I'm still rather unclear on why this happens, and why I cannot reproduce fragmentation issues with dummynet pipes locally.
One thing does look a little suspect: both netstat outputs show dropped packets: "2428 output packets dropped due to no bufs, etc."
One of the places that counter gets incremented is in the fragmentation code. We ought to be able to get an idea if that's happening with a dtrace oneliner:
dtrace -n 'fbt::ip_fragment:return / arg1 != 0 / { printf("%d", arg1); stack(); }'
(end with Ctrl+C. Be sure to leave it running at least a few seconds after a problematic fragment has been observed.)
The most likely reason for such a failure would be a memory allocation failure, so we should grab the uma stats as well:
vmstat -z
Updated by Kristof Provost 11 months ago
While we're gathering things let's also dump the dummynet pipe information:
dnctl pipe list dnctl pipe show
Right now I'm more inclined to believe we're low on memory than anything else, but we may as well gather as much information as we can.
On that note: let's grab a process list (ps auxf -d) and sysctl -a output too.
Updated by Georgiy Tyutyunnik 11 months ago
- File pipe.txt added
- File ps.txt added
- File putty.log added
- File vmstat.txt added
- File dtraces.txt added
attached is requested info from todays diagnostic session
dtraces taken specifically during the calls
pipe info irrelevant - tested without limiter on the SIP rule
Updated by Kristof Provost 11 months ago
There's a clue in the last dtrace that's backed up by the pf debug output in dmesg:
[23.09.1-RELEASE][pfsense@fr-val-net-rt1.quantificare.net]/home/pfsense: sudo dtrace -n 'fbt::pf_reassemble:entry { printf("%x -> %x , off %x", ((struct ip *)arg1)->ip_src.s_addr, ((struct ip *)arg1)->ip_dst.s_addr, ntohs(((struct ip *)arg1)->ip_off)); }' dtrace: description 'fbt::pf_reassemble:entry ' matched 1 probe CPU ID FUNCTION:NAME 14 28927 pf_reassemble:entry f6fb10ac -> b5dc39b9 , off 2000 14 28927 pf_reassemble:entry f6fb10ac -> b5dc39b9 , off b9 10 28927 pf_reassemble:entry 9dbe874d -> b5dc39b9 , off b9
and
pf_normalize_ip: reass frag 51646 @ 0-1480 pf_fillup_fragment: reass frag 51646 @ 0-1480 pf_reassemble: frag 51646, holes 1 pf_normalize_ip: reass frag 51646 @ 1480-1518 pf_fillup_fragment: reass frag 51646 @ 1480-1518 pf_reassemble: complete: 0xfffff8014e211000(1538) pf_normalize_ip: reass frag 51646 @ 1480-1518 pf_fillup_fragment: reass frag 51646 @ 1480-1518 pf_reassemble: frag 51646, holes 1
Both demonstrate that we reassemble a packet from two fragments, and do so twice. The first time we get both fragments, the second time we only see the last fragment. The Wireshark capture shows that the first one was sent out, somehow bypassing the outbound reassembly. That causes the second fragment to be held while pf fruitlessly waits for the missing fragment to turn up.
The dtrace output also shows that the successful reassembly is pre-nat, while the failed one is post-nat, which corresponds to reassembly on the inbound packet and a second time on the outbound packet.
The source and destination addresses match what we see in the LAN/WAN Wireshark captures.
It's still not clear to me how this happens, but it's a useful clue. I'm continuing to dig into the code.
Updated by Kristof Provost 11 months ago
The expected flow here is that the fragmented packets arrive, get passed through the inbound (pf_test(PF_IN)) test, reassembled (pf_reassemble()) and then processed as a single (> 1500 bytes) packet. In this configuration it'll be route-to-'d, which means pf_test(PF_IN) will call pf_route(), which tests if it'll be allowed out on the route-to interface (pf_test(PF_OUT)), and then pf_route transmits it directly, without returning to the IP stack for routing.
pf_route() may end up fragmenting it (and would, in this case), but that's after all checks are done, so the packet shouldn't end up back in pf_reassemble().
Given the dtrace observations it must be getting fragmented again somewhere, but I can't find a plausible path for that to happen.
It might help to know where the second pass through pf_reassemble() comes from, so the following dtrace snippet would be helpful:
sudo dtrace -n 'fbt::pf_reassemble:entry { printf("%x -> %x , off %x", ((struct ip *)arg1)->ip_src.s_addr, ((struct ip *)arg1)->ip_dst.s_addr, ntohs(((struct ip *)arg1)->ip_off)); stack(); }'
Updated by Georgiy Tyutyunnik 11 months ago
- File nolimiter_success.txt added
- File limiter_fail.txt added
- File removedlimiter_success.txt added
adding results for dtraces from today's testing.
it fails with limiter and works without it this time, with ability to reproduce
Updated by Kristof Provost 11 months ago
Okay, that's useful.
The main point we get from this is that the second time we see the trailing fragment (and don't see the first fragment) we're being called from dummynet (via dummynet_task() and dummynet_send()).
That's consistent with the observation that it works without dummynet.
I have a theory of what could be happening here. It's rather convoluted, because the packet takes a very convoluted route through the stack.
Basically: the fragments first arrive in pf_test(PF_IN) calls, and we re-assemble it. That's all correct and expected (and supported by the stack traces dtrace reported). We then apply the actions according to our ruleset on the full packet. That's route-to, so we end up in pf_route(), which both re-fragments the packet into two and passes it to dummynet via pf_dummynet() / pf_dummynet_route().
What happens next is speculation, but it matches observations. The first packet is allowed through immediately, that is, dummynet doesn't take ownership of the mbuf, so we just continue processing, directly calling ifp->if_output(). That means that fragment is sent out (which we see in the Wireshark captures).
The second fragment is less lucky, and gets delayed. This means pf_pdesc_to_dnflow() translates the pf information to dummynet flow information, including the direction information. That's where things go wrong, because we're doing input processing (i.e. pf_test(PF_IN)).
So we tell dummynet that it's IPFW_ARGS_IN, and dummynet later releases the packet again, into netisr_dispatch(), which means we pf_test(PF_IN) again. That passes, because the mbuf is tagged with pf_mtag->flags PF_MTAG_FLAG_DUMMYNET. pf_test(PF_IN) removes that tag and immediately allows the packet through. The network stack routes the packet, and we end up in pf_test(PF_OUT), where we no longer have a PF_MTAG_FLAG_DUMMYNET flag set, so we perform full processing on the packet. That is, we attempt to reassemble it (as we see in the dtrace stack traces). That'll fail, because the first fragment has already been transmitted, and will never hit pf again.
I'll need a day or two to try to reproduce this in a test case. I believe the fix will be straightforward, so if the customer can run test builds we might be able to confirm it faster than that.
Odds are good that this will be fixed in 24.03.
Updated by Kristof Provost 11 months ago
I've pushed the fix upstream and cherry-picked it to our branches. The next snapshot build will have it.
Updated by Marcos M 11 months ago
- Subject changed from UDP fragmentation issue on 1537 to Fragmented packets delayed by dummynet are lost
- Category changed from Unknown to Traffic Shaper (Limiters)
- Assignee set to Kristof Provost
- Target version set to 2.8.0
- Private changed from Yes to No
- Plus Target Version set to 24.03
- Affected Architecture deleted (
amd64)
Updated by Jim Pingle 11 months ago
- Subject changed from Fragmented packets delayed by dummynet are lost to Fragmented packets delayed by limiters are lost
Updating subject for release notes.