Bug #14396
closedReassembled packets received on a VTI are not forwarded
100%
Description
Larger than MTU backets, which require fragmentation, cannot be routed on an IPsec VTI interface. Here is an example trace:
Incoming over VTI interface:
15:34:41.576981 AF IPv4 (2), length 1400: (tos 0x0, ttl 127, id 62903, offset 0, flags [+], proto ICMP (1), length 1396)
172.20.130.53 > 172.20.140.100: ICMP echo request, id 1, seq 464, length 1376
15:34:41.577000 AF IPv4 (2), length 656: (tos 0x0, ttl 127, id 62903, offset 1376, flags [none], proto ICMP (1), length 652)
172.20.130.53 > 172.20.140.100: ip-proto-1
Outgoing on LAN interface:
15:35:52.961652 1a:cb:63:20:dd:3f > 00:50:56:9a:dd:a2, ethertype IPv4 (0x0800), length 2042: (tos 0x0, ttl 126, id 62905, offset 0, flags [none], proto ICMP (1), length 2028)
172.20.130.53 > 172.20.140.100: ICMP echo request, id 1, seq 466, length 2008
It looks like fragmentation is applied correctly over the tunnel, but it is forwarded to the client on the LAN interface without fragmenting.
I found this bug from 2017, which seems to be related: https://redmine.pfsense.org/issues/7801. Unfortunately the pull request references no longer work, thus I cannot find the exact changes.
I have tried all combinations for the System / Advanced / Firewall & NAT / VPN Packet Processing / Reassemble IP Fragments until they form a complete packet, but it does not have any effect on the issue. I seems like something is wrong specifically when using a VTI interface.
I think it is related to the default scrub rule with fragment reassemble as indicated here https://forum.netgate.com/topic/26822/allow-fragments-in-rules.
So, I have now tried, in a lab, to disable Firewall Scrub in System / Advanced / Firewall & NAT. With this, packets which require fragmentation are now working correctly over the VTI link.
However, I do not really want to disable pf scrub entirely, I do not consider this a work-around. I am also a bit unsure whether this will break a lot of over parts of the network.
As this breaks all UDP traffic which requires fragmentation, the impact of this bug is high. A common scenario would be RADIUS for 802.1x over a VTI link completely breaks.
Files
Updated by Jim Pingle over 1 year ago
- Status changed from New to Feedback
Can you reproduce this on a 23.05 RC snapshot?
Have you applied all of the available recommended System Patches?
This issue may also be relevant: https://redmine.pfsense.org/issues/14098
Updated by Christopher de Haas over 1 year ago
Thanks for replying. I have just updated a Netgate 4100 lab unit to 23.05-RC (23.05.r.20230519.0600). Unfortunately, the behavior is exactly the same. No system patches are available for 23.05 RC :)
Updated by Jim Pingle over 1 year ago
- Subject changed from Fragmentation broken on IPsec VTI tunnel to Fragmentation broken on IPsec VTI tunnel with scrub enabled
- Status changed from Feedback to New
- Priority changed from High to Normal
OK, thanks for checking. There wouldn't be any patches yet for 23.05, just for 23.01. If it still happens on 23.05 than the other issue I mentioned isn't related since it would already be using the correct syntax.
And by trying that you likely already had the same net effect as toggling the "IP Fragment Reassemble" option under System > Advanced, but you might double check that on 23.05 just in case it didn't.
Updated by Christopher de Haas over 1 year ago
Just checked the IP Fragment Reassemble toggle, and it has no effect on this issue on 23.05 either
Updated by Christopher de Haas over 1 year ago
We are scrambling a bit to at least find a workaround here. Unfortunately, disabling PF Scrub is not a viable work-around as it breaks a lot of firewall rule processing. Specifically, fragmented UDP traffic can no longer be matched.
Updated by Marcos M over 1 year ago
- Subject changed from Fragmentation broken on IPsec VTI tunnel with scrub enabled to Fragmented packets received on a VTI are not forwarded
- Status changed from New to Confirmed
I was able to reproduce this on 23.01.
All VTI have an MTU of 1446 and the rest have an MTU of 1500. Topology:
┌───────────────────┐ │ external gateway │ └─┬───────────────┬─┘ │ .1 │ │ │ │ 10.0.5.0/24 │ .50 │ │ .75 ┌────────────┴┐ ┌┴────────────┐ │external host│ │ router │ └─────────────┘ └──┬───────┬──┘ .1 │ │ .1 │ │ 192.0.2.0/28 │ │ 198.51.100.0/28 │ │ .2 │ │ .2 ┌────────┴┐ ┌┴────────┐ │ site-a │ │ site-b │ │ gateway ├─────┤ gateway │ └─────┬───┘ vti └────┬────┘ .1 │ │ .1 │ │ 172.19.1.0/24 │ │ 192.168.1.0/24 │ │ .4 │ │ .200 ┌────┴───┐ ┌────┴───┐ │ site-a │ │ site-b │ │ host │ │ host │ └────────┘ └────────┘
site-b host
sends large packet over VTI:
# ping -s 2000 -S 192.168.1.200 PING 172.19.1.4 (172.19.1.4) 2000(2028) bytes of data. ^C --- 172.19.1.4 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3079ms
site-b gateway
receives the fragmented packets on LAN
[23.01-RELEASE][root@siteb-fw2.lab.arpa]/root: tcpdump -eni vmx1 'host 192.168.1.200 and host 172.19.1.4' tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vmx1, link-type EN10MB (Ethernet), capture size 262144 bytes 01:22:10.783783 00:50:56:b2:30:b2 > 00:50:56:b2:eb:49, ethertype IPv4 (0x0800), length 1514: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 3, seq 1, length 1480 01:22:10.783801 00:50:56:b2:30:b2 > 00:50:56:b2:eb:49, ethertype IPv4 (0x0800), length 562: 192.168.1.200 > 172.19.1.4: ip-proto-1 01:22:11.814463 00:50:56:b2:30:b2 > 00:50:56:b2:eb:49, ethertype IPv4 (0x0800), length 1514: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 3, seq 2, length 1480 01:22:11.814476 00:50:56:b2:30:b2 > 00:50:56:b2:eb:49, ethertype IPv4 (0x0800), length 562: 192.168.1.200 > 172.19.1.4: ip-proto-1 01:22:12.838502 00:50:56:b2:30:b2 > 00:50:56:b2:eb:49, ethertype IPv4 (0x0800), length 1514: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 3, seq 3, length 1480 01:22:12.838518 00:50:56:b2:30:b2 > 00:50:56:b2:eb:49, ethertype IPv4 (0x0800), length 562: 192.168.1.200 > 172.19.1.4: ip-proto-1 01:22:13.862510 00:50:56:b2:30:b2 > 00:50:56:b2:eb:49, ethertype IPv4 (0x0800), length 1514: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 3, seq 4, length 1480 01:22:13.862525 00:50:56:b2:30:b2 > 00:50:56:b2:eb:49, ethertype IPv4 (0x0800), length 562: 192.168.1.200 > 172.19.1.4: ip-proto-1 ^C 8 packets captured 16947 packets received by filter 0 packets dropped by kernel
site-a gateway
receives the fragmented packets on the VTI:
[23.01-RELEASE][root@sitea-fw1.lab.arpa]/root: tcpdump -eni ipsec2 'host 192.168.1.200 and host 172.19.1.4' tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ipsec2, link-type NULL (BSD loopback), capture size 262144 bytes 20:26:06.349464 AF IPv4 (2), length 1448: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 4, seq 1, length 1424 20:26:06.349488 AF IPv4 (2), length 608: 192.168.1.200 > 172.19.1.4: ip-proto-1 20:26:07.367320 AF IPv4 (2), length 1448: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 4, seq 2, length 1424 20:26:07.367341 AF IPv4 (2), length 608: 192.168.1.200 > 172.19.1.4: ip-proto-1 20:26:08.391286 AF IPv4 (2), length 1448: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 4, seq 3, length 1424 20:26:08.391307 AF IPv4 (2), length 608: 192.168.1.200 > 172.19.1.4: ip-proto-1 20:26:09.415303 AF IPv4 (2), length 1448: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 4, seq 4, length 1424 20:26:09.415329 AF IPv4 (2), length 608: 192.168.1.200 > 172.19.1.4: ip-proto-1 ^C 8 packets captured 278 packets received by filter 0 packets dropped by kernel
site-a gateway
tries to send the reassembled packet out of the LAN:
[23.01-RELEASE][root@sitea-fw1.lab.arpa]/root: tcpdump -eni vmx4 'host 192.168.1.200 and host 172.19.1.4' tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vmx4, link-type EN10MB (Ethernet), capture size 262144 bytes 20:28:01.378099 00:50:56:b2:a5:f1 > 00:50:56:b2:c9:65, ethertype IPv4 (0x0800), length 2042: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 5, seq 1, length 2008 20:28:02.407292 00:50:56:b2:a5:f1 > 00:50:56:b2:c9:65, ethertype IPv4 (0x0800), length 2042: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 5, seq 2, length 2008 20:28:03.431311 00:50:56:b2:a5:f1 > 00:50:56:b2:c9:65, ethertype IPv4 (0x0800), length 2042: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 5, seq 3, length 2008 20:28:04.455596 00:50:56:b2:a5:f1 > 00:50:56:b2:c9:65, ethertype IPv4 (0x0800), length 2042: 192.168.1.200 > 172.19.1.4: ICMP echo request, id 5, seq 4, length 2008 ^C 4 packets captured 28 packets received by filter 0 packets dropped by kernel
site-a host
does not receive the fragmented packets:
[22.01-DEVELOPMENT][root@sitea-lanhost.lab.arpa]/root: tcpdump -eni vmx0 'host 192.168.1.200 and host 172.19.1.4' tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vmx0, link-type EN10MB (Ethernet), capture size 262144 bytes ^C 0 packets captured 31 packets received by filter 0 packets dropped by kernel
-
Related Tests¶
I ran two similar additional tests without encapsulated traffic:
- From LAN host to WAN gateway: OK
- From WAN gateway to LAN host: OK
Another test from an external host
which relies on ICMP redirects to find the destination host, e.g.:
10.0.5.50 -> 172.19.1.4 ICMP request (via 10.0.5.1) 10.0.5.1 -> 10.0.5.50 ICMP redirect 10.0.5.50 -> 172.19.1.4 ICMP request (via 10.0.5.75)
In this case, the ICMP request arrives at
router
's WAN but is seemingly dropped (does not appear on its LAN pcap). By adding a route on the external host, it works. It's not clear to me why router
would behave differently when external host
needs an ICMP redirect.Updated by Kristof Provost over 1 year ago
I believe I understand what's going on here, but Marcos will test my theories on his setup soon.
Basically, there's a bug in ip_output(), where we read the length of the IP packet before we hand it over to the firewall (i.e. pf) and don't re-read it afterwards. pf can change the packet size (when it's reassembling packets), and that can cause us to use the incorrect size when comparing to the MTU, and then we don't fragment when we should.
That's a kernel bug, but Marcos will also test an idea I have about a workaround (basically we'll generate scrub rules that only run on incoming packets, not outgoing ones). That ought to mean this won't manifest either.
Updated by Marcos M over 1 year ago
- Status changed from Confirmed to In Progress
- Assignee set to Kristof Provost
- Target version set to 23.09
Updated by Marcos M over 1 year ago
Christopher de Haas please test the following patch (apply then reboot) to work around the issue on 23.01/23.05:
Updated by Christopher de Haas over 1 year ago
Hi Marcos,
Thank you very much! I have tested in a small lab, and the patch seems to work as intended. I will test in a bigger setup as soon as possible.
Christopher
Updated by Kristof Provost over 1 year ago
I've merged the network stack fix into the devel-main branch. It'll be present in tomorrow's 2.7 snapshots and get merged to plus-devel-main in due course.
commit 650dcbc3051c01d3a40831b6d5c91873a328f259 (HEAD -> devel-main, origin/devel-main) Author: Kristof Provost <kp@FreeBSD.org> Date: Fri Jun 2 16:38:30 2023 +0200 netinet: re-read IP length after PFIL hook The pfil hook may modify the packet, so before we check its length (to decide if it needs to be fragmented or not) we should re-read that length. This is most likely to happen when pf is reassembling packets. In that scenario we'd receive the last fragment, which is likely to be a short packet, pf would reassemble it (likely exceeding the interface MTU) and then we'd transmit it without fragmenting, because we're comparing the MTU to the length of the last fragment, not the fully reassembled packet. See also: https://redmine.pfsense.org/issues/14396 Reviewed by: cy MFC after: 3 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D40395 (cherry picked from commit 185c1cddd7ef34db82bc3a25b3c92556416a4e55)
Updated by Christopher de Haas over 1 year ago
Hello again,
I am working on more extensive testing in a full setup. With the patch, I still see messages like this in packet capture on the VTI interface. Both routers are 23.05 with your patch. Is that expected?
11:18:28.010586 IP 172.20.190.40.51752 > 172.20.130.2.1812: UDP, bad length 1811 > 1368
Updated by Christopher de Haas over 1 year ago
I seems that if I enable "Reassemble IP Fragments until they form a complete packet" in combination with your fix, everything is working as expected. I will continue testing, but so far this looks good. I am not sure why the VPN fragment reassemble should have any effect here
Updated by Jim Pingle over 1 year ago
- Project changed from pfSense Plus to pfSense
- Category changed from IPsec to IPsec
- Status changed from In Progress to Feedback
- Target version changed from 23.09 to 2.7.0
- % Done changed from 0 to 100
- Affected Plus Version deleted (
23.01) - Plus Target Version set to 23.05.1
Updated by Christopher de Haas over 1 year ago
Thank you all very much for taking this issue seriously.
Something is still not quite right here. I am testing with 2 pfSense 23.05 with the patch. The patch is enabled on both routers and the "Reassemble IP Fragments until they form a complete packet" has no effect on this problem.
Firewall rules for IPsec traffic is on the IPsec tab. Fragmented traffic cannot be matched, as it appears as fragments on the IPsec tab. Firewall logs look like this. I would not expect traffic to show as fragmented here.
I have looked at the patch, as far as I can understand, I makes sense. However, looking at /tmp/rules.debug, the <vpn_networks> does not include all correct networks. What is this table built from? Specifically, for a VTI tunnel, how can you know? I can see the missing network in question in the routing table as:
172.20.190.0/24 172.20.188.2 UG1 92 1400 ipsec46
Which is correct, but presumable this is not the source of the <vpn_networks> table?
Forcing a filter reload from Status/Filter Reload does not fix it. The network is still not in the vpn_networks table
Updated by Christopher de Haas over 1 year ago
I found the filter_get_vpns_list() funtion, and as far as I can tell this will never include networks routed over a VTI link. Here is why I believe it appeared to work.
We are migrating from a tunnel mode IPsec, and the tunnel phase 1 is disabled. Looking a the filter_get_vpns_list() another seemingly bug here, only excluded phase2 entries which are disabled, not considering that the entire phase1 is disabled. Thus that is why some networks appear in the vpn_networks table. I have validated this by adding a bogus disabled phase 1 with phase two entries for the routed networks, and the vpn_networks table now includes those networks, and the original re-assembly patch now works. However, this is obviously not a viable workaround.
Would it make sense for the filter_get_vpns_list() funtion to look as the routing table instead and take networks which are routed over an ipsec (or other tunnel) interface?
Or maybe the original patch can look at interfaces rather than L3 networks?
(The routes are getting into the routing table via FRR/OSPF)
Updated by Marcos M over 1 year ago
- Status changed from Feedback to Resolved
I can confirm that the patch works correctly with both reassembly and filtering (FWIW the actual fix cannot be applied via a patch). From your description, it sounds like there may be a configuration issue which would need to be discussed separately (e.g. forums).
For reference, VTI phase 2 configurations use dedicated interfaces with their own scrub rules, hence the vpn_networks
table is not a factor. However, I wouldn't expect for disabled Phase2 entries to be included in the table, but that would be a separate issue report.
Updated by Marcos M over 1 year ago
- Subject changed from Fragmented packets received on a VTI are not forwarded to Reassembled packets received on a VTI are not forwarded
Updated by Christopher de Haas over 1 year ago
I would very much like to understand what I am missing here. The patch changes
- $scrubrules .= "scrub from any to <vpn_networks> {$maxmss} {$scrubnodf} {$fragreassemble}\n";
- $scrubrules .= "scrub from <vpn_networks> to any {$maxmss} {$scrubnodf} {$fragreassemble}\n";
to
+ $scrubrules .= "scrub in from any to <vpn_networks> {$maxmss} {$scrubnodf} {$fragreassemble}\n";
+ $scrubrules .= "scrub in from <vpn_networks> to any {$maxmss} {$scrubnodf} {$fragreassemble}\n";
In my rules.debug (23.05 with the patch applied) I have
scrub in from any to <vpn_networks> max-mss 1300 fragment reassemble
scrub in from <vpn_networks> to any max-mss 1300 fragment reassemble
Thus I can presumably conclude that the patch is applied. The change in scrub rules is very clearly dependent on <vpn_networks>. And what I am trying to say, is that the vpn_networks table is not correct. It does not include networks which are routed to via a VTI link, and that could not be, because those networks are not necessarily statically known. As far I as I understand the code, that is also evident when looking at the filter_get_vpns_list() function. Looking at my rules.debug, that also confirms that those networks are not in the table.
I get that the bug about including phase two networks for disabled phase one entries in the vpn_networks table is another issue. For this issue, it is only interesting, as that bug introduces a workaround in creating a disabled phase 1 with phase two entries for the remote site networks, thus having them become part of the vpn_networks table, and the patch behavior to be correct. But again, this is obviously not a viable solution.