Project

General

Profile

Actions

Bug #13423

closed

IPv6 neighbor discovery protocol (NDP) fails in some cases

Added by Chris Linstruth about 2 years ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Category:
IPv6 Router Advertisements (radvd/rtsold)
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
23.09
Release Notes:
Default
Affected Version:
Affected Architecture:

Description

This is proving fairly difficult to pin down a set of "steps to duplicate." In some cases an IPv6 interface seems to ignore received Neighbor Solicitation packets for an address on that interface and not respond.

If traffic is generated from this non-responding host, causing NDP to be performed in the other direction, communication is possible until the NDP entry expires. Connections are then impossible if originated from the other host.

This looks to be something in FreeBSD upstream.


Files

packetcapture-NDP-13423.cap (3.91 KB) packetcapture-NDP-13423.cap Chris Linstruth, 12/07/2022 08:48 AM

Related issues

Related to Bug #13555: When WAN is lost, ipv6 interface will not renew upon WAN availabilityDuplicate

Actions
Actions #1

Updated by Jim Pingle about 2 years ago

A few other details:

This seems to only affect GUA (and possibly ULA) addresses, Link Local addresses always respond to NDP. I first noticed this as some of my lab VMs failing to ping their gateway when the gateway was configured as a static GUA address. Change it to use the LL address of the same host and it responds.

Packet capture shows the ND packet arrive, no response is generated.

Firewall rules are passing the ND packets, nothing is dropped by pf.

We haven't yet managed to figure out the exact circumstances around when/why it starts and have not yet been able to reproduce it on demand.

Actions #4

Updated by Jim Pingle about 2 years ago

Pim Pish wrote in #note-3:

Here's a similar case.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263288

We saw that and considered it but in this case the "bad neighbor solicitation messages" counter is still 0 and there are no logged errors as there are on that FreeBSD issue. The NS messages aren't spoofed and are from the prefix configured on both neighbors.

Actions #5

Updated by David Durrleman about 2 years ago

I confirm that I experience and can reproduce this bug reliably on my local setup (pfsense 22.05-RELEASE on Netgate 1100).
When trying to access ipv6 from my TrueNAS autoconfigured address on the network advertised by pfsense, no route is found. If I ping6 the TrueNAS ip from pfsense, I get a response, and this does indeed enable ipv6 communication in the other direction for a while. It was working on 22.01, and broken after upgrading to 22.05.

I had asked for help on this issue in a forum post (https://forum.netgate.com/topic/173508/lost-ipv6-connectivity-on-truenas-core-after-upgrade-from-22-01-to-22-05) which gathers feedback from other users similarly affected, and also believe this was previously described in #12663
Let me know if I can help troubleshoot further.

Actions #6

Updated by Flole Systems about 2 years ago

Might be this issue: https://www.mail-archive.com/freebsd-net@freebsd.org/msg63838.html

There's also some info on how to attempt to reproduce it in the thread.

Actions #7

Updated by Chris Linstruth almost 2 years ago

This is from a system that is currently refusing to offer NDP to a host:

icmp6:
    486626 calls to icmp6_error
    0 errors not generated in response to an icmp6 message
    0 errors not generated because of rate limitation
    Output histogram:
        unreach: 11439
        packet too big: 474526
        time exceed: 661
        echo: 34403395
        echo reply: 51545205
        router solicitation: 6
        router advertisement: 160334
        neighbor solicitation: 3246252
        neighbor advertisement: 2870720
        MLDv2 listener report: 1194
    1 message with bad code fields
    0 messages < minimum length
    0 bad checksums
    0 messages with bad length
    Input histogram:
        unreach: 9319
        time exceed: 91
        echo: 51545205
        echo reply: 34347542
        router solicitation: 8552
        router advertisement: 3309133
        neighbor solicitation: 2870720
        neighbor advertisement: 3207350
        redirect: 3
    Histogram of error messages to be generated:
        0 no route
        0 administratively prohibited
        0 beyond scope
        11366 address unreachable
        73 port unreachable
        474526 packet too big
        661 time exceed transit
        0 time exceed reassembly
        0 erroneous header field
        0 unrecognized next header
        0 unrecognized option
        12791 redirect
        0 unknown
    51545205 message responses generated
    0 messages with too many ND options
    0 messages with bad ND options
    0 bad neighbor solicitation messages
    0 bad neighbor advertisement messages
    0 bad router solicitation messages
    0 bad router advertisement messages
    0 bad redirect messages
    0 path MTU changes
Actions #8

Updated by Chris Linstruth almost 2 years ago

Here is a packet capture filtered on the MAC address that is not receiving NDP responses. (Taken on the node that is not responding)

Actions #9

Updated by Jim Pingle almost 2 years ago

As with cjl, a packet capture on an affected target shows the NS arrive, but there is no NA response. Other hosts in the same segment appear to be sending and receiving NS/NA messages as expected.

Using the dtrace snippet from https://www.mail-archive.com/freebsd-net@freebsd.org/msg63871.html it does not print anything at all when one of these sources sends an NS, which implies it's being dropped before it reaches that point. It does produce output for some other hosts in the same segment which are working.

Comparing the output of the packet capture, ifconfig, and ifmcstat for the interfaces on the source and target, it appears the source is sending the NS to a multicast address the target isn't joined to for some reason, so that's a lead at least. Still investigating.

Actions #10

Updated by Kristof Provost almost 2 years ago

Jim and I have done a bit more digging on his setup, and we believe the issue is that the interface is not joined on the Solicited-node multicast address for the relevant address.

Specifically, the interface has (trimmed for relevance):

: ifconfig igc0.40
igc0.40: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    inet6 2001:db8::1 prefixlen 64

but we see multicast groups:

: ifmcstat -i igc0.40
igc0.40:
    inet6 fe80::208:a2ff:fe12:169e%igc0.40 scopeid 0x10
    mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3
        group ff02::2%igc0.40 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:00:00:02
        group ff05::1:3 mode exclude
            mcast-macaddr 33:33:00:01:00:03
        group ff02::1:2%igc0.40 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:01:00:02
    inet 198.51.100.1
    igmpv3 rv 2 qi 125 qri 10 uri 3
        group 224.0.0.1 mode exclude
            mcast-macaddr 01:00:5e:00:00:01
    inet6 fe80::208:a2ff:fe12:169e%igc0.40 scopeid 0x10
    mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3
        group ff01::1%igc0.40 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:00:00:01
        group ff02::2:2c49:1bf4%igc0.40 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:2c:49:1b:f4
        group ff02::2:ff2c:491b%igc0.40 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:ff:2c:49:1b
        group ff02::1%igc0.40 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:00:00:01
        group ff02::1:ff12:169e%igc0.40 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:ff:12:16:9e

We'd expect to see group ff02::1:ff00:1 there. That causes the icmp6_input() function to discard the packet (because it's for a multicast group we're not a member of). That increments the ip6s_notmember counter ("multicast packets which we don't join" in `netstat -s -p ip6`.

It's still not clear why we wouldn't' have joined the correct multicast group. There's a suspicion this may be related to having multiple addresses on the interface (as is more common on HA setups), but there's no hard evidence for that just yet.

I'd be good if other affected users could repeat the ifmcstat check, just so we can confirm we're all looking at the same thing.

Actions #11

Updated by Jim Pingle almost 2 years ago

In my case I had an extra IP alias VIP on that interface for fe80:: and removing that VIP and saving/applying the interface got it back into the correct multicast groups and it started working again. Though I was then able to add the VIP back and it still was in the correct group, so it's unclear if that alone is related.

Things to check:

  • Packet capture the NS and check the destination address (e.g. ff02::1<blah>) and destination MAC (e.g. 33:33:ff:00:00:01)
  • ifconfig <name> on both and note the inet6 addresses on all interfaces involved.
  • ifmcstat -i <name> on both and the multicast groups/addresses to which they are joined.

Save/apply on the affected interface on the target, re-test, and check the output on all of those again.

If feasible, remove any IPv6 VIPs on the interface, then save/apply on the interface, re-test, and check the output on all of those again.

Actions #12

Updated by Chris Linstruth almost 2 years ago

lagg0.1301: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: XENWAN
    options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
    ether 90:ec:77:17:b3:f4
    inet6 fe80::92ec:77ff:fe17:b3f4%lagg0.1301 prefixlen 64 scopeid 0x12
    inet6 2001:470:e01a:7fff::1 prefixlen 64
    inet 172.25.228.1 netmask 0xffffff00 broadcast 172.25.228.255
    groups: vlan
    vlan: 1301 vlanpcp: 0 parent interface: lagg0
    media: Ethernet autoselect
    status: active
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
lagg0.1301:
    inet6 fe80::92ec:77ff:fe17:b3f4%lagg0.1301 scopeid 0x12
    mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3
        group ff05::1:3 mode exclude
            mcast-macaddr 33:33:00:01:00:03
        group ff02::1:2%lagg0.1301 scopeid 0x12 mode exclude
            mcast-macaddr 33:33:00:01:00:02
    inet 172.25.228.1
    igmpv3 rv 2 qi 125 qri 10 uri 3
        group 224.0.0.6 mode exclude
            mcast-macaddr 01:00:5e:00:00:06
        group 224.0.0.5 mode exclude
            mcast-macaddr 01:00:5e:00:00:05
    inet6 fe80::92ec:77ff:fe17:b3f4%lagg0.1301 scopeid 0x12
    mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3
        group ff02::2%lagg0.1301 scopeid 0x12 mode exclude
            mcast-macaddr 33:33:00:00:00:02
    inet 172.25.228.1
    igmpv3 rv 2 qi 125 qri 10 uri 3
        group 224.0.0.1 mode exclude
            mcast-macaddr 01:00:5e:00:00:01
    inet6 fe80::92ec:77ff:fe17:b3f4%lagg0.1301 scopeid 0x12
    mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3
        group ff01::1%lagg0.1301 scopeid 0x12 mode exclude
            mcast-macaddr 33:33:00:00:00:01
        group ff02::2:d324:29d7%lagg0.1301 scopeid 0x12 mode exclude
            mcast-macaddr 33:33:d3:24:29:d7
        group ff02::2:ffd3:2429%lagg0.1301 scopeid 0x12 mode exclude
            mcast-macaddr 33:33:ff:d3:24:29
        group ff02::1%lagg0.1301 scopeid 0x12 mode exclude
            mcast-macaddr 33:33:00:00:00:01
        group ff02::1:ff17:b3f4%lagg0.1301 scopeid 0x12 mode exclude
            mcast-macaddr 33:33:ff:17:b3:f4
14:46:36.857676 IP6 2001:470:e01a:7fff::12 > ff02::1:ff00:1: ICMP6, neighbor solicitation, who has 2001:470:e01a:7fff::1, length 32
14:46:37.948117 IP6 2001:470:e01a:7fff::12 > ff02::1:ff00:1: ICMP6, neighbor solicitation, who has 2001:470:e01a:7fff::1, length 32
Actions #13

Updated by Kristof Provost almost 2 years ago

It may also be useful to set `net.inet6.icmp6.nd6_debug` to 1 in the system tunables, and then restarting the machine. When the issue recurs the dmesg output may contain a hint about why we failed to join the group.

Actions #14

Updated by Tito Sacchi almost 2 years ago

Opening the interface configuration page and clicking 'Save' and then 'Apply' without changing anything solves the problem for me. Maybe it reconfigures the address and joins the right multicast group.
Do you know if there is a way to automate this with a shell script? I can send POST requests to the pages with cURL but it is dumb AF.

Actions #15

Updated by Chris Linstruth almost 2 years ago

OK -

Tested saving the interface and it did add multicast group:

    mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3
        group ff02::1:ff00:1%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:ff:00:00:01

I set net.inet6.icmp6.nd6_debug=1 and rebooted. I had these multicast groups:

lagg0.1301:
    inet 172.25.228.1
    igmpv3 rv 2 qi 125 qri 10 uri 3
        group 224.0.0.6 mode exclude
            mcast-macaddr 01:00:5e:00:00:06
        group 224.0.0.5 mode exclude
            mcast-macaddr 01:00:5e:00:00:05
    inet6 fe80::92ec:77ff:fe46:e997%lagg0.1301 scopeid 0x10
    mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3
        group ff02::2%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:00:00:02
        group ff05::1:3 mode exclude
            mcast-macaddr 33:33:00:01:00:03
        group ff02::1:2%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:01:00:02
    inet 172.25.228.1
    igmpv3 rv 2 qi 125 qri 10 uri 3
        group 224.0.0.1 mode exclude
            mcast-macaddr 01:00:5e:00:00:01
    inet6 fe80::92ec:77ff:fe46:e997%lagg0.1301 scopeid 0x10
    mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3
        group ff01::1%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:00:00:01
        group ff02::2:d324:29d7%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:d3:24:29:d7
        group ff02::2:ffd3:2429%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:ff:d3:24:29
        group ff02::1%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:00:00:01
        group ff02::1:ff46:e997%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:ff:46:e9:97

Edited/saved the interface again and now have these:

lagg0.1301:
    inet6 fe80::92ec:77ff:fe46:e997%lagg0.1301 scopeid 0x10
    mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3
        group ff02::1:ff00:1%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:ff:00:00:01
    inet 172.25.228.1
    igmpv3 rv 2 qi 125 qri 10 uri 3
        group 224.0.0.1 mode exclude
            mcast-macaddr 01:00:5e:00:00:01
        group 224.0.0.6 mode exclude
            mcast-macaddr 01:00:5e:00:00:06
        group 224.0.0.5 mode exclude
            mcast-macaddr 01:00:5e:00:00:05
    inet6 fe80::92ec:77ff:fe46:e997%lagg0.1301 scopeid 0x10
    mldv2 flags=2<USEALLOW> rv 2 qi 125 qri 10 uri 3
        group ff02::2%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:00:00:02
        group ff05::1:3 mode exclude
            mcast-macaddr 33:33:00:01:00:03
        group ff02::1:2%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:01:00:02
        group ff01::1%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:00:00:01
        group ff02::2:d324:29d7%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:d3:24:29:d7
        group ff02::2:ffd3:2429%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:ff:d3:24:29
        group ff02::1%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:00:00:00:01
        group ff02::1:ff46:e997%lagg0.1301 scopeid 0x10 mode exclude
            mcast-macaddr 33:33:ff:46:e9:97

Reminder this is the NDP being sent by the downstream pfSense router:

15:23:05.776663 IP6 2001:470:e01a:7fff::13 > ff02::1:ff00:1: ICMP6, neighbor solicitation, who has 2001:470:e01a:7fff::1, length 32
15:23:05.812173 IP6 2001:470:e01a:7fff::12 > ff02::1:ff00:1: ICMP6, neighbor solicitation, who has 2001:470:e01a:7fff::1, length 32
Actions #16

Updated by Tito Sacchi almost 2 years ago

I found a way to automate this process with pfSsh.php:

foreach ($config['interfaces'] as $if => $ifcfg) {
    if(isset($ifcfg['enable'])) {
        printf('Reconfiguring %1$s...%2$s', $if, PHP_EOL);
        interface_reconfigure($if);
        sleep(4);
    }
}
exec
exit

You can run this with pfSsh.php < file_containing_above_code.php.
Doesn't work always though. It could be related to the following issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233535 .
Just upgraded to pfSense Plus 23.01 DEVEL on my Netgate 7100-1U, I'll see if this happens on 23.01 (FreeBSD 14) too.

Actions #17

Updated by Tito Sacchi almost 2 years ago

It occurs on 23.01 DEVEL too. I kindly ask Netgate to take a look at this issue because it breaks IPv6 almost completely.
My GRE tunnels (over IPsec) join the right multicast groups. The bugged interface are all 802.1Q VLANs, I'll check whether this happens on raw Ethernet interfaces too.

Actions #18

Updated by Matt Gaynor over 1 year ago

Also facing this issue, with the same lack of NDP response from pfSense, IPv6 is unusable when using a non link-local address directly on an ethernet interface, this is a major bug.

Actions #19

Updated by Jim Pingle over 1 year ago

Matt Gaynor wrote in #note-18:

Also facing this issue, with the same lack of NDP response from pfSense, IPv6 is unusable when using a non link-local address directly on an ethernet interface, this is a major bug.

If you can easily reproduce it, look at the suggestion in #note-13 and see if it produces any helpful output in the system log / kernel message buffer.

Actions #20

Updated by Marcos M over 1 year ago

I am running into this issue on 23.05-BETA using vmx. It seems to be similar to this issue upstream given that the comments there suggest multiple drivers are affected: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253469

Running with net.inet6.icmp6.nd6_debug=1 and rebooting shows (though this seems to be unrelated):

nd6_options: unsupported option 24 - option ignored

Actions #21

Updated by Josh Balcom over 1 year ago

I also am experiencing this same issue and I can reliably re-produce it. However, I am not getting any output in dmesg with the icmpv6 debugging enabled that would suggest why it is not automatically joining the ff02::1:ff00:1 multicast group.

Actions #22

Updated by Marcos M about 1 year ago

  • Status changed from New to Waiting on Merge
  • Assignee changed from Jim Pingle to Kristof Provost
  • Target version set to 2.8.0
  • Plus Target Version set to 23.09
Actions #23

Updated by Kristof Provost about 1 year ago

I've pushed the fix upstream in https://cgit.freebsd.org/src/commit/?id=9c9a76dc6873427b14f6c84397dd60ea8e529d8d and a basic test case in https://cgit.freebsd.org/src/commit/?id=b03012d0b600793d7501b4cc56757ec6150ec87f

I intend to cherry-pick that just as soon as I finish the src/ports merge I started on before those fixes went in.

Actions #24

Updated by Kristof Provost about 1 year ago

  • Status changed from Waiting on Merge to Feedback

And that's been cherry-picked to our branches as well. Future snapshot builds will have the fix.

Actions #25

Updated by Jim Pingle about 1 year ago

I upgraded my edge to a dev snap with the fix and so far, so good. Everything across the board is green in my lab for monitoring and in the past at least some of them would fail. It will take some time to know for certain, however, since it seemed to be more problematic after some time up and running.

Actions #26

Updated by Marcos M about 1 year ago

  • Status changed from Feedback to Resolved

I was able to reliably reproduce this before, and can no longer reproduce it with the fix.

Actions #27

Updated by Jim Pingle about 1 year ago

  • Status changed from Resolved to Feedback

Lets wait until we get more real-world testing to call it completely resolved.

Actions #28

Updated by Jim Pingle about 1 year ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100

Seems to be solid here after several days in a row and several interface events. Gateways are still showing green throughout the lab where they would have started failing by now in the past.

Actions #29

Updated by Jim Pingle 11 months ago

  • Target version changed from 2.8.0 to 2.7.1
Actions #30

Updated by Marcos M 11 months ago

  • Related to Bug #13555: When WAN is lost, ipv6 interface will not renew upon WAN availability added
Actions

Also available in: Atom PDF