Bug #13423
closedIPv6 neighbor discovery protocol (NDP) fails in some cases
Added by Chris Linstruth about 2 years ago. Updated 10 months ago.
100%
Description
This is proving fairly difficult to pin down a set of "steps to duplicate." In some cases an IPv6 interface seems to ignore received Neighbor Solicitation packets for an address on that interface and not respond.
If traffic is generated from this non-responding host, causing NDP to be performed in the other direction, communication is possible until the NDP entry expires. Connections are then impossible if originated from the other host.
This looks to be something in FreeBSD upstream.
Files
packetcapture-NDP-13423.cap (3.91 KB) packetcapture-NDP-13423.cap | Chris Linstruth, 12/07/2022 08:48 AM |
Related issues
Updated by Jim Pingle about 2 years ago
A few other details:
This seems to only affect GUA (and possibly ULA) addresses, Link Local addresses always respond to NDP. I first noticed this as some of my lab VMs failing to ping their gateway when the gateway was configured as a static GUA address. Change it to use the LL address of the same host and it responds.
Packet capture shows the ND packet arrive, no response is generated.
Firewall rules are passing the ND packets, nothing is dropped by pf.
We haven't yet managed to figure out the exact circumstances around when/why it starts and have not yet been able to reproduce it on demand.
Updated by Pim Pish about 2 years ago
Here's a similar case.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263288
Updated by Jim Pingle about 2 years ago
Pim Pish wrote in #note-3:
Here's a similar case.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263288
We saw that and considered it but in this case the "bad neighbor solicitation messages" counter is still 0 and there are no logged errors as there are on that FreeBSD issue. The NS messages aren't spoofed and are from the prefix configured on both neighbors.
Updated by David Durrleman almost 2 years ago
I confirm that I experience and can reproduce this bug reliably on my local setup (pfsense 22.05-RELEASE on Netgate 1100).
When trying to access ipv6 from my TrueNAS autoconfigured address on the network advertised by pfsense, no route is found. If I ping6 the TrueNAS ip from pfsense, I get a response, and this does indeed enable ipv6 communication in the other direction for a while. It was working on 22.01, and broken after upgrading to 22.05.
I had asked for help on this issue in a forum post (https://forum.netgate.com/topic/173508/lost-ipv6-connectivity-on-truenas-core-after-upgrade-from-22-01-to-22-05) which gathers feedback from other users similarly affected, and also believe this was previously described in #12663
Let me know if I can help troubleshoot further.
Updated by Flole Systems almost 2 years ago
Might be this issue: https://www.mail-archive.com/freebsd-net@freebsd.org/msg63838.html
There's also some info on how to attempt to reproduce it in the thread.
Updated by Chris Linstruth almost 2 years ago
This is from a system that is currently refusing to offer NDP to a host:
icmp6: 486626 calls to icmp6_error 0 errors not generated in response to an icmp6 message 0 errors not generated because of rate limitation Output histogram: unreach: 11439 packet too big: 474526 time exceed: 661 echo: 34403395 echo reply: 51545205 router solicitation: 6 router advertisement: 160334 neighbor solicitation: 3246252 neighbor advertisement: 2870720 MLDv2 listener report: 1194 1 message with bad code fields 0 messages < minimum length 0 bad checksums 0 messages with bad length Input histogram: unreach: 9319 time exceed: 91 echo: 51545205 echo reply: 34347542 router solicitation: 8552 router advertisement: 3309133 neighbor solicitation: 2870720 neighbor advertisement: 3207350 redirect: 3 Histogram of error messages to be generated: 0 no route 0 administratively prohibited 0 beyond scope 11366 address unreachable 73 port unreachable 474526 packet too big 661 time exceed transit 0 time exceed reassembly 0 erroneous header field 0 unrecognized next header 0 unrecognized option 12791 redirect 0 unknown 51545205 message responses generated 0 messages with too many ND options 0 messages with bad ND options 0 bad neighbor solicitation messages 0 bad neighbor advertisement messages 0 bad router solicitation messages 0 bad router advertisement messages 0 bad redirect messages 0 path MTU changes
Updated by Chris Linstruth almost 2 years ago
Here is a packet capture filtered on the MAC address that is not receiving NDP responses. (Taken on the node that is not responding)
Updated by Jim Pingle almost 2 years ago
As with cjl, a packet capture on an affected target shows the NS arrive, but there is no NA response. Other hosts in the same segment appear to be sending and receiving NS/NA messages as expected.
Using the dtrace snippet from https://www.mail-archive.com/freebsd-net@freebsd.org/msg63871.html it does not print anything at all when one of these sources sends an NS, which implies it's being dropped before it reaches that point. It does produce output for some other hosts in the same segment which are working.
Comparing the output of the packet capture, ifconfig, and ifmcstat for the interfaces on the source and target, it appears the source is sending the NS to a multicast address the target isn't joined to for some reason, so that's a lead at least. Still investigating.
Updated by Kristof Provost almost 2 years ago
Jim and I have done a bit more digging on his setup, and we believe the issue is that the interface is not joined on the Solicited-node multicast address for the relevant address.
Specifically, the interface has (trimmed for relevance):
: ifconfig igc0.40 igc0.40: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 inet6 2001:db8::1 prefixlen 64
but we see multicast groups:
: ifmcstat -i igc0.40 igc0.40: inet6 fe80::208:a2ff:fe12:169e%igc0.40 scopeid 0x10 mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3 group ff02::2%igc0.40 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:00:00:02 group ff05::1:3 mode exclude mcast-macaddr 33:33:00:01:00:03 group ff02::1:2%igc0.40 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:01:00:02 inet 198.51.100.1 igmpv3 rv 2 qi 125 qri 10 uri 3 group 224.0.0.1 mode exclude mcast-macaddr 01:00:5e:00:00:01 inet6 fe80::208:a2ff:fe12:169e%igc0.40 scopeid 0x10 mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3 group ff01::1%igc0.40 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:00:00:01 group ff02::2:2c49:1bf4%igc0.40 scopeid 0x10 mode exclude mcast-macaddr 33:33:2c:49:1b:f4 group ff02::2:ff2c:491b%igc0.40 scopeid 0x10 mode exclude mcast-macaddr 33:33:ff:2c:49:1b group ff02::1%igc0.40 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:00:00:01 group ff02::1:ff12:169e%igc0.40 scopeid 0x10 mode exclude mcast-macaddr 33:33:ff:12:16:9e
We'd expect to see group ff02::1:ff00:1 there. That causes the icmp6_input() function to discard the packet (because it's for a multicast group we're not a member of). That increments the ip6s_notmember counter ("multicast packets which we don't join" in `netstat -s -p ip6`.
It's still not clear why we wouldn't' have joined the correct multicast group. There's a suspicion this may be related to having multiple addresses on the interface (as is more common on HA setups), but there's no hard evidence for that just yet.
I'd be good if other affected users could repeat the ifmcstat check, just so we can confirm we're all looking at the same thing.
Updated by Jim Pingle almost 2 years ago
In my case I had an extra IP alias VIP on that interface for fe80:: and removing that VIP and saving/applying the interface got it back into the correct multicast groups and it started working again. Though I was then able to add the VIP back and it still was in the correct group, so it's unclear if that alone is related.
Things to check:
- Packet capture the NS and check the destination address (e.g.
ff02::1<blah>
) and destination MAC (e.g.33:33:ff:00:00:01
) ifconfig <name>
on both and note the inet6 addresses on all interfaces involved.ifmcstat -i <name>
on both and the multicast groups/addresses to which they are joined.
Save/apply on the affected interface on the target, re-test, and check the output on all of those again.
If feasible, remove any IPv6 VIPs on the interface, then save/apply on the interface, re-test, and check the output on all of those again.
Updated by Chris Linstruth almost 2 years ago
lagg0.1301: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: XENWAN options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6> ether 90:ec:77:17:b3:f4 inet6 fe80::92ec:77ff:fe17:b3f4%lagg0.1301 prefixlen 64 scopeid 0x12 inet6 2001:470:e01a:7fff::1 prefixlen 64 inet 172.25.228.1 netmask 0xffffff00 broadcast 172.25.228.255 groups: vlan vlan: 1301 vlanpcp: 0 parent interface: lagg0 media: Ethernet autoselect status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
lagg0.1301: inet6 fe80::92ec:77ff:fe17:b3f4%lagg0.1301 scopeid 0x12 mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3 group ff05::1:3 mode exclude mcast-macaddr 33:33:00:01:00:03 group ff02::1:2%lagg0.1301 scopeid 0x12 mode exclude mcast-macaddr 33:33:00:01:00:02 inet 172.25.228.1 igmpv3 rv 2 qi 125 qri 10 uri 3 group 224.0.0.6 mode exclude mcast-macaddr 01:00:5e:00:00:06 group 224.0.0.5 mode exclude mcast-macaddr 01:00:5e:00:00:05 inet6 fe80::92ec:77ff:fe17:b3f4%lagg0.1301 scopeid 0x12 mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3 group ff02::2%lagg0.1301 scopeid 0x12 mode exclude mcast-macaddr 33:33:00:00:00:02 inet 172.25.228.1 igmpv3 rv 2 qi 125 qri 10 uri 3 group 224.0.0.1 mode exclude mcast-macaddr 01:00:5e:00:00:01 inet6 fe80::92ec:77ff:fe17:b3f4%lagg0.1301 scopeid 0x12 mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3 group ff01::1%lagg0.1301 scopeid 0x12 mode exclude mcast-macaddr 33:33:00:00:00:01 group ff02::2:d324:29d7%lagg0.1301 scopeid 0x12 mode exclude mcast-macaddr 33:33:d3:24:29:d7 group ff02::2:ffd3:2429%lagg0.1301 scopeid 0x12 mode exclude mcast-macaddr 33:33:ff:d3:24:29 group ff02::1%lagg0.1301 scopeid 0x12 mode exclude mcast-macaddr 33:33:00:00:00:01 group ff02::1:ff17:b3f4%lagg0.1301 scopeid 0x12 mode exclude mcast-macaddr 33:33:ff:17:b3:f4
14:46:36.857676 IP6 2001:470:e01a:7fff::12 > ff02::1:ff00:1: ICMP6, neighbor solicitation, who has 2001:470:e01a:7fff::1, length 32 14:46:37.948117 IP6 2001:470:e01a:7fff::12 > ff02::1:ff00:1: ICMP6, neighbor solicitation, who has 2001:470:e01a:7fff::1, length 32
Updated by Kristof Provost almost 2 years ago
It may also be useful to set `net.inet6.icmp6.nd6_debug` to 1 in the system tunables, and then restarting the machine. When the issue recurs the dmesg output may contain a hint about why we failed to join the group.
Updated by Tito Sacchi over 1 year ago
Opening the interface configuration page and clicking 'Save' and then 'Apply' without changing anything solves the problem for me. Maybe it reconfigures the address and joins the right multicast group.
Do you know if there is a way to automate this with a shell script? I can send POST requests to the pages with cURL but it is dumb AF.
Updated by Chris Linstruth over 1 year ago
OK -
Tested saving the interface and it did add multicast group:
mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3 group ff02::1:ff00:1%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:ff:00:00:01
I set net.inet6.icmp6.nd6_debug=1 and rebooted. I had these multicast groups:
lagg0.1301: inet 172.25.228.1 igmpv3 rv 2 qi 125 qri 10 uri 3 group 224.0.0.6 mode exclude mcast-macaddr 01:00:5e:00:00:06 group 224.0.0.5 mode exclude mcast-macaddr 01:00:5e:00:00:05 inet6 fe80::92ec:77ff:fe46:e997%lagg0.1301 scopeid 0x10 mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3 group ff02::2%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:00:00:02 group ff05::1:3 mode exclude mcast-macaddr 33:33:00:01:00:03 group ff02::1:2%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:01:00:02 inet 172.25.228.1 igmpv3 rv 2 qi 125 qri 10 uri 3 group 224.0.0.1 mode exclude mcast-macaddr 01:00:5e:00:00:01 inet6 fe80::92ec:77ff:fe46:e997%lagg0.1301 scopeid 0x10 mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3 group ff01::1%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:00:00:01 group ff02::2:d324:29d7%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:d3:24:29:d7 group ff02::2:ffd3:2429%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:ff:d3:24:29 group ff02::1%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:00:00:01 group ff02::1:ff46:e997%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:ff:46:e9:97
Edited/saved the interface again and now have these:
lagg0.1301: inet6 fe80::92ec:77ff:fe46:e997%lagg0.1301 scopeid 0x10 mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3 group ff02::1:ff00:1%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:ff:00:00:01 inet 172.25.228.1 igmpv3 rv 2 qi 125 qri 10 uri 3 group 224.0.0.1 mode exclude mcast-macaddr 01:00:5e:00:00:01 group 224.0.0.6 mode exclude mcast-macaddr 01:00:5e:00:00:06 group 224.0.0.5 mode exclude mcast-macaddr 01:00:5e:00:00:05 inet6 fe80::92ec:77ff:fe46:e997%lagg0.1301 scopeid 0x10 mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3 group ff02::2%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:00:00:02 group ff05::1:3 mode exclude mcast-macaddr 33:33:00:01:00:03 group ff02::1:2%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:01:00:02 group ff01::1%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:00:00:01 group ff02::2:d324:29d7%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:d3:24:29:d7 group ff02::2:ffd3:2429%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:ff:d3:24:29 group ff02::1%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:00:00:00:01 group ff02::1:ff46:e997%lagg0.1301 scopeid 0x10 mode exclude mcast-macaddr 33:33:ff:46:e9:97
Reminder this is the NDP being sent by the downstream pfSense router:
15:23:05.776663 IP6 2001:470:e01a:7fff::13 > ff02::1:ff00:1: ICMP6, neighbor solicitation, who has 2001:470:e01a:7fff::1, length 32 15:23:05.812173 IP6 2001:470:e01a:7fff::12 > ff02::1:ff00:1: ICMP6, neighbor solicitation, who has 2001:470:e01a:7fff::1, length 32
Updated by Tito Sacchi over 1 year ago
I found a way to automate this process with pfSsh.php:
foreach ($config['interfaces'] as $if => $ifcfg) { if(isset($ifcfg['enable'])) { printf('Reconfiguring %1$s...%2$s', $if, PHP_EOL); interface_reconfigure($if); sleep(4); } } exec exit
You can run this with pfSsh.php < file_containing_above_code.php
.
Doesn't work always though. It could be related to the following issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233535 .
Just upgraded to pfSense Plus 23.01 DEVEL on my Netgate 7100-1U, I'll see if this happens on 23.01 (FreeBSD 14) too.
Updated by Tito Sacchi over 1 year ago
It occurs on 23.01 DEVEL too. I kindly ask Netgate to take a look at this issue because it breaks IPv6 almost completely.
My GRE tunnels (over IPsec) join the right multicast groups. The bugged interface are all 802.1Q VLANs, I'll check whether this happens on raw Ethernet interfaces too.
Updated by Matt Gaynor over 1 year ago
Also facing this issue, with the same lack of NDP response from pfSense, IPv6 is unusable when using a non link-local address directly on an ethernet interface, this is a major bug.
Updated by Jim Pingle over 1 year ago
Matt Gaynor wrote in #note-18:
Also facing this issue, with the same lack of NDP response from pfSense, IPv6 is unusable when using a non link-local address directly on an ethernet interface, this is a major bug.
If you can easily reproduce it, look at the suggestion in #note-13 and see if it produces any helpful output in the system log / kernel message buffer.
Updated by Marcos M over 1 year ago
I am running into this issue on 23.05-BETA using vmx. It seems to be similar to this issue upstream given that the comments there suggest multiple drivers are affected: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253469
Running with net.inet6.icmp6.nd6_debug=1
and rebooting shows (though this seems to be unrelated):
nd6_options: unsupported option 24 - option ignored
Updated by Josh Balcom about 1 year ago
I also am experiencing this same issue and I can reliably re-produce it. However, I am not getting any output in dmesg with the icmpv6 debugging enabled that would suggest why it is not automatically joining the ff02::1:ff00:1 multicast group.
Updated by Marcos M about 1 year ago
- Status changed from New to Waiting on Merge
- Assignee changed from Jim Pingle to Kristof Provost
- Target version set to 2.8.0
- Plus Target Version set to 23.09
Preliminary fix upstream: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233683
Updated by Kristof Provost about 1 year ago
I've pushed the fix upstream in https://cgit.freebsd.org/src/commit/?id=9c9a76dc6873427b14f6c84397dd60ea8e529d8d and a basic test case in https://cgit.freebsd.org/src/commit/?id=b03012d0b600793d7501b4cc56757ec6150ec87f
I intend to cherry-pick that just as soon as I finish the src/ports merge I started on before those fixes went in.
Updated by Kristof Provost about 1 year ago
- Status changed from Waiting on Merge to Feedback
And that's been cherry-picked to our branches as well. Future snapshot builds will have the fix.
Updated by Jim Pingle about 1 year ago
I upgraded my edge to a dev snap with the fix and so far, so good. Everything across the board is green in my lab for monitoring and in the past at least some of them would fail. It will take some time to know for certain, however, since it seemed to be more problematic after some time up and running.
Updated by Marcos M about 1 year ago
- Status changed from Feedback to Resolved
I was able to reliably reproduce this before, and can no longer reproduce it with the fix.
Updated by Jim Pingle about 1 year ago
- Status changed from Resolved to Feedback
Lets wait until we get more real-world testing to call it completely resolved.
Updated by Jim Pingle about 1 year ago
- Status changed from Feedback to Resolved
- % Done changed from 0 to 100
Seems to be solid here after several days in a row and several interface events. Gateways are still showing green throughout the lab where they would have started failing by now in the past.
Updated by Marcos M 10 months ago
- Related to Bug #13555: When WAN is lost, ipv6 interface will not renew upon WAN availability added