Project

General

Profile

Actions

Bug #9577

closed

radvd send_ra_forall failed on interface / can't join ipv6-allrouters

Added by Manuel Piovan over 5 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
IPv6 Router Advertisements (radvd/rtsold)
Target version:
Start date:
06/07/2019
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
2.5.0
Affected Architecture:

Description

https://forum.netgate.com/topic/142363/ipv6-broken-radvd-can-t-join-ipv6-allrouters-on-interface/19

log is full of

Jun 6 18:57:34     radvd     66014     resuming normal operation
Jun 6 18:57:34     radvd     66014     attempting to reread config file
Jun 6 14:01:49     radvd     65728     version 2.17 started 
Jun 6 14:00:39     radvd     67952     can't join ipv6-allrouters on igb2
Jun 6 14:00:39     radvd     67952     can't join ipv6-allrouters on ath0_wlan0
Jun 6 14:00:33     radvd     67952     can't join ipv6-allrouters on igb2
Jun 6 14:00:22     radvd     67952     can't join ipv6-allrouters on igb2
Jun 6 14:00:20     radvd     67952     can't join ipv6-allrouters on ath0_wlan0
Jun 6 14:00:08     radvd     67952     can't join ipv6-allrouters on igb2
Jun 6 14:00:01     radvd     67952     can't join ipv6-allrouters on ath0_wlan0
Jun 6 13:59:53     radvd     67952     can't join ipv6-allrouters on ath0_wlan0
Jun 6 13:59:53     radvd     67952     can't join ipv6-allrouters on igb2
Jun 6 13:59:40     radvd     67952     can't join ipv6-allrouters on ath0_wlan0
Jun 6 13:59:36     radvd     67952     can't join ipv6-allrouters on igb2
Jun 6 13:59:30     radvd     67952     can't join ipv6-allrouters on igb2
Jun 6 13:59:25     radvd     67952     can't join ipv6-allrouters on ath0_wlan0

more debug output

Jun 7 10:00:25     radvd     55719     polling for 16 second(s), next iface is ath0_wlan0
Jun 7 10:00:25     radvd     55719     igb1 next scheduled RA in 16 second(s)
Jun 7 10:00:25     radvd     55719     send_ra_forall failed on interface igb1
Jun 7 10:00:25     radvd     55719     not sending RA for igb1, interface is not ready
Jun 7 10:00:25     radvd     55719     can't join ipv6-allrouters on igb1
Jun 7 10:00:25     radvd     55719     igb1 address: fe80::a236:9fff:fe85:96f1
Jun 7 10:00:25     radvd     55719     igb1 address: xxxx:xxx:xx:xxx::1
Jun 7 10:00:25     radvd     55719     igb1 linklocal address: fe80::a236:9fff:fe85:96f1
Jun 7 10:00:25     radvd     55719     IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jun 7 10:00:25     radvd     55719     checking ipv6 forwarding of interface not supported
Jun 7 10:00:25     radvd     55719     prefix length for igb1 is 64
Jun 7 10:00:25     radvd     55719     link layer token length for igb1 is 48
Jun 7 10:00:25     radvd     55719     mtu for igb1 is 1500
Jun 7 10:00:25     radvd     55719     igb1 supports multicast or is point-to-point
Jun 7 10:00:25     radvd     55719     igb1 is running
Jun 7 10:00:25     radvd     55719     igb1 is up
Jun 7 10:00:25     radvd     55719     ioctl(SIOCGIFFLAGS) succeeded on igb1
Jun 7 10:00:25     radvd     55719     timer_handler called for igb1
Jun 7 10:00:25     radvd     55719     polling for 0 second(s), next iface is igb1
Jun 7 10:00:25     radvd     55719     igb2 next scheduled RA in 16 second(s)
Jun 7 10:00:25     radvd     55719     send_ra_forall failed on interface igb2
Jun 7 10:00:25     radvd     55719     not sending RA for igb2, interface is not ready
Jun 7 10:00:25     radvd     55719     can't join ipv6-allrouters on igb2
Jun 7 10:00:25     radvd     55719     igb2 address: fe80::a236:9fff:fe85:96f2
Jun 7 10:00:25     radvd     55719     igb2 address: xxxx:xxx:xxx:xxxx::1
Jun 7 10:00:25     radvd     55719     igb2 linklocal address: fe80::a236:9fff:fe85:96f2
Jun 7 10:00:25     radvd     55719     IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jun 7 10:00:25     radvd     55719     checking ipv6 forwarding of interface not supported
Jun 7 10:00:25     radvd     55719     prefix length for igb2 is 64
Jun 7 10:00:25     radvd     55719     link layer token length for igb2 is 48

Files

Example sequence.docx (19.5 KB) Example sequence.docx Log Snip-it of calls supporting RADVD process Ronald Schellberg, 10/10/2019 02:12 PM
radvd-2.18_5-v2.5test.txz (51.1 KB) radvd-2.18_5-v2.5test.txz RADVD port compiled for 2.5. For test purposes only. Ronald Schellberg, 02/12/2020 01:57 PM
radvd-2.18_5.txz (50.3 KB) radvd-2.18_5.txz Patched RADVD port Ronald Schellberg, 06/19/2020 08:18 AM
Actions #1

Updated by Manuel Piovan over 5 years ago

ipv6 gateway disappear from connected clients and ipv6 is not working anymore, i need to restart radvd to make it work again for some times

Actions #2

Updated by Greg M over 5 years ago

Now I have this as well:

Jun 29 07:17:29 radvd 62926 can't join ipv6-allrouters on hn0.10
Jun 29 07:15:22 radvd 62926 can't join ipv6-allrouters on hn0.10
Jun 29 07:15:00 radvd 62926 can't join ipv6-allrouters on hn0.9
Jun 29 07:13:07 radvd 62926 can't join ipv6-allrouters on hn0.7
Jun 29 07:12:47 radvd 62926 can't join ipv6-allrouters on hn0.10
Jun 29 07:11:25 radvd 62926 can't join ipv6-allrouters on hn0.8
Jun 29 07:11:23 radvd 62926 can't join ipv6-allrouters on hn0.9
Jun 29 07:10:22 radvd 62926 can't join ipv6-allrouters on hn0.10
Jun 29 07:08:10 radvd 62926 can't join ipv6-allrouters on hn0.10

Actions #3

Updated by Greg M over 5 years ago

Now I don`t have above any more but I have this (but everything is working just fine):

Jul 22 14:44:54 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:43:25 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:41:56 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:41:20 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:40:03 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:39:37 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:37:53 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:37:32 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:36:42 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:34:32 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:31:44 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:30:26 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:29:31 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:29:21 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway

Actions #4

Updated by Manuel Piovan over 5 years ago

Greg M wrote:

Now I don`t have above any more but I have this (but everything is working just fine):

IPv6 forwarding on interface seems to be disabled, but continuing anyway

confirming this, same here
radvd is now 2.18

Actions #5

Updated by Greg M over 5 years ago

Hi!

Can someone PLEASE take a look at this one.

Thanks!

Actions #6

Updated by Ronald Schellberg about 5 years ago

There are multiple issues, some easily solved. The "disabled" logging message can be deleted, as it is just an indication that for FreeBSD the feature is stubbed out. I can submit a RADVD patch file for interface.c to delete 5 lines.

I have been bashing away at this for several weeks now and need some advise from Netgate whether to continue on with the 2.5 version or focusing more on changes that have been made to stable/12.

I have tried incorporating some of stable/12 and the issue still exists but to a lesser extent, having seen that stable/12 doesn't solve the problem, I have switched back to 2.5.

What I have found is an issue with the FreeBSD in6p_leave_group, every other call, it finds and removes the desired group. The subsequent call to in6p_join_group reinserts the group, but not correctly. The pointer to ifp should be listed in the last entry of the list but is NULL. The next leave/join group cycle in RADVD, the in6p_leave_group fails to find the entry (duh the entry is NULL at this point) and since none found it exits. Well, the subsequent call to in6p_join_group also does not find the entry, so the list is incremented and the entry correctly added until the "radvd can't join ipv6-allrouters" condition occurs (somewhere between 1000 and 2000 leave/join cycles or about 24 hours for me). It would be nice if the leave/join implementation of RADVD was not necessary.

I attached a notated word document showing 4 RADVD leave/join cycles with numerous added log messages that details the above sequence.

I can continue to bash away at this on 2.5 but if the changes in stable/12 are going to get incorporated soon before 2.5 is released, my time may be better spent testing and fixing it.

Actions #7

Updated by Jim Pingle about 5 years ago

2.5 will be moving to a 12.1 or stable/12 base, but that choice has not yet been made. It definitely will not stay on 12.0, though.

Even if 12.1 is selected, if specific changes to stable/12 after 12.1-RELEASE are beneficial, we can pick those back if needed.

Actions #8

Updated by Ronald Schellberg about 5 years ago

After several failed attempts at creating a 12.1 version, the process that worked was to create a new branch from pfSense/releng/12.1 then cherry-picking commits from the 2.5 branch since mid-February. I also applied my 6RD patch to this branch as I need the stf changes to get ipv6 working for me.

That patch caused a kernel panic and a reboot on my bare metal firewall, that was impossible to capture on the vga console. So I switched tactics, and created hyper-v VM instance on my build machine which has two hardware network interfaces but I needed an ISO with a serial console to capture the console spew. Read multiple rebuilds over the last 20 days. Last night I finally have a version that successfully installs and boots.

With similar logging added to sys/netinet6/in6_mcast.c, I can confirm that releng/12.1 appears not to have the same issues that 2.5 has since 12.1 rewrote the internals in6_mcast.c. RADVD has been running about 5 hours now and I expect it to continue like the 2.4 branch. I can confirm tomorrow, as it would stop working for me after about 24 hours.

I would like to try removing the IPV6_LEAVE_GROUP call from the bsd44.c patch of RADVD to see if that is still necessary, but want to make sure this version is stable first.

Actions #9

Updated by Ronald Schellberg about 5 years ago

Ronald Schellberg wrote:

I can confirm tomorrow, as it would stop working for me after about 24 hours.

I would like to try removing the IPV6_LEAVE_GROUP call from the bsd44.c patch of RADVD to see if that is still necessary, but want to make sure this version is stable first.

Rebuilt a clean version (without logging and debug) and that has been running on the VM for almost 2 days. Now installed it on my bare metal main router.

On a side note, why has issue dropped from the 2.5 issue list????

Actions #10

Updated by Jim Pingle about 5 years ago

  • Target version set to 2.5.0

Ronald Schellberg wrote:

On a side note, why has issue dropped from the 2.5 issue list????

It was never assigned a target version, so it was never on that list, so it couldn't be "dropped" from the list.

I've added it now, it definitely needs addressed before release, but from the looks of the other info here and in the forum thread it may solve itself once we move the base to 12.1.

The workaround from the forum thread isn't pretty, but it does work. Add a cron job for:

0    *    *    *    *    root    /usr/bin/killall radvd && /bin/sleep 5 && /usr/local/sbin/radvd -p /var/run/radvd.pid -C /var/etc/radvd.conf -m syslog

I haven't tested it, but this would probably also work:

/usr/local/sbin/pfSsh.php playback svc stop radvd && /bin/sleep 5 && /usr/local/sbin/pfSsh.php playback svc start radvd

Actions #11

Updated by Ronald Schellberg almost 5 years ago

After shifting from RELENG 12.1 to Stable/12, I noticed that the commit labeled MFC r355881 on 12/25/19 again triggered the "can't join ipv6-allrouters" problem in RADVD. Reverting the commit, resolved the issue again. The problem is RADVD implementation in pfSense performs IPV6_LEAVE_GROUP/IPV6_JOIN_GROUP sequence every 10 to 15 secs (the timing is randomly selected). These two calls do not communicate well together causing the multicast tables to slowly fill up until the "can't join " error occurs, typically 24+ hours later, then RADVD begins to fail.

I have spent a while trying to chase this down, when a comment by JimP on another issue got me thinking there could well be another solution to the problem. While it doesn't fix the FreeBSD problem described above, it may well be an acceptable solution. This solution should work on any version of FreeBSD. I modified the patch-device-bsd44.c file in the RADVD port by inserting a check to see if the socket has already joined a group, if so, return without leaving and rejoining the group. As far as I can tell, RADVD doesn't contemplate ever leaving a group.

See the code patch below:

@-int setup_allrouters_membership(int sock, struct Interface *iface) { return 0; }
+#define MAX_IFACE 10
+int setup_allrouters_membership(int sock, struct Interface *iface) 
+{
+    static int socket_count = 0;
+    static int msockets[MAX_IFACE] = {};
+    int i;
+    struct ipv6_mreq mreq;
+
+    for (i=0;i<socket_count;i++) {
+        if (msockets[i] == sock) {
+            return 0;
+        }
+    }
+    if (socket_count < MAX_IFACE-1) {
+        msockets[socket_count] = sock;
+        socket_count++;
+    }
+
+    memset(&mreq, 0, sizeof(mreq));
+    mreq.ipv6mr_interface = iface->props.if_index;
+
+    /* all-routers multicast address */
+    if (inet_pton(AF_INET6, "ff02::2",
+            &mreq.ipv6mr_multiaddr.s6_addr) != 1) {
+        flog(LOG_ERR, "inet_pton failed");
+        return (-1);
+    }
+
+    if (setsockopt(sock, IPPROTO_IPV6, IPV6_JOIN_GROUP,
+            &mreq, sizeof(mreq)) < 0) {
+        flog(LOG_ERR, "can't join ipv6-allrouters on %s", iface->props.name);
+        return (-1);
+    }
+
+    return 0; 
+}
@

I just used a simple array to track up to 10 interfaces, if more are needed it could be altered to a linked-list or simply expanded. I was just not sure how many might get created. Looking at the responses above 10 may not be sufficient.

This appears to work, my test VM has been running for more than 24 hours and my IPv6 is still 10/10 on test-ipv6.com.

Actions #12

Updated by Ronald Schellberg almost 5 years ago

Attached is a compiled RADVD for 2.5 with the above patch (slightly modified) incorporated. Added a logging message when a socket is added to the msocket array and additional information was added to the "can't join" message if the IPV6_JOIN_GROUP call fails. Bumped the dimension of the msocket array to 50 for good measure.

The ravdv-2.18_5-v2.5test.txz file is attached.

Actions #13

Updated by Jim Pingle almost 5 years ago

Is there a pull request on Github for this? I don't see one. If there is not, can you submit that source change as a pull request on Github?

https://docs.netgate.com/pfsense/en/latest/development/submitting-a-pull-request-via-github.html

Actions #14

Updated by Ronald Schellberg almost 5 years ago

There is not one yet, waiting for some confirmation from others. I'll submit one latter tonight.

Actions #15

Updated by Ronald Schellberg almost 5 years ago

Pull Request # 773 submitted

Actions #16

Updated by Ronald Schellberg almost 5 years ago

Ronald Schellberg wrote:

The ravdv-2.18_5-v2.5test.txz file is attached.

My bare metal router running my version of 2.5 has been up 14 days now and still 10/10 on test-ipv6.com

My VM version on stable 12 also showing similar results however it tends to be rebuilt more often to incorporate the latest commits.

Actions #17

Updated by Michael Smith almost 5 years ago

Ronald Schellberg wrote:

Pull Request # 773 submitted

Can you add a link to the PR?

Actions #19

Updated by Ronald Schellberg over 4 years ago

Ronald Schellberg wrote:

Attached is a compiled RADVD for 2.5 with the above patch (slightly modified) incorporated. Added a logging message when a socket is added to the msocket array and additional information was added to the "can't join" message if the IPV6_JOIN_GROUP call fails. Bumped the dimension of the msocket array to 50 for good measure.

The ravdv-2.18_5-v2.5test.txz file is attached.

Attached is a updated RADVD compiled with current 2.5 stable-12 branch.

For those experiencing IPV6 failures after 24 or so hours due to RADVD consider:
  1. uploading this file to the router /TMP directory
  2. issuing a "pkg install -y /tmp/radvd-2.18_5.txz" command
  3. reboot

Confirm messages like below in your routing log to make sure the new version is applied:
Jun 18 21:13:49 radvd 32115 version 2.18 started
Jun 18 21:13:49 radvd 32327 adding ipv6-allrouters on hn1, sock: 4, iface->props.if_index:6

The patch should resolve the issue until PR #773 gets incorporated.

I have had installs run for more than 35 days using this patch, only to be stopped for other 2.5 updates.

Actions #20

Updated by Michael Geiger over 4 years ago

The patch should resolve the issue until PR #773 gets incorporated.

I have had installs run for more than 35 days using this patch, only to be stopped for other 2.5 updates.

Thanks a lot for your contribution. I installed your patched radvd and will test it also.

Do we know what currently prevent your patch from being merged?

Actions #21

Updated by Louis B over 4 years ago

Hi,

I installed the patch and a lot of messages where gone. What was in the log after reboot is
Jul 6 12:31:08 pfSense radvd23095: adding ipv6-allrouters on lagg0.88, sock: 4, iface->props.if_index:19
Jul 6 12:31:08 pfSense radvd22990: version 2.18 started

Must say, perhaps that is OK, but I do not at all understand the first line. I have many vlans so it is strange to me that one of them is mentioned here (vlan88)

With my config, appart of the messages, I think ... all IPV6 was and is working correcty.
Note that I did perform any testing appart of some IPV6 pings.

Louis

Actions #22

Updated by Luiz Souza over 4 years ago

  • Status changed from New to Feedback
  • Assignee set to Luiz Souza
  • % Done changed from 0 to 100

Fixed in FreeBSD, the port workaround is unnecessary now.

Thanks for all the details Ronald.

Actions #23

Updated by Ronald Schellberg over 4 years ago

Don't know that anyone has noticed but the build system has stopped posting snaps since 7/9 00:50, which makes it more difficult to provide feedback on this and other recent changes. :-)

I confirmed that the 7/9 00:50 version begins to fail after 28:30 hours, so I reverted and rebased my local build this weekend. I can confirm that both my VM and bare metal installation go beyond that point and are continuing without error. Not monitoring/logging the call sequence, my only concern that is it a full fix or did it just push the issue down the road a bit. Time will tell.

One additional change FreeBSD-src that would make the #2878 Leave_group call unnecessary would be to eliminate the error return on duplicate join_group calls. Not sure what is in the design spec makes rejecting a duplicate necessary. I haven't tested it, but might.

Actions #24

Updated by Ronald Schellberg over 4 years ago

"One additional change FreeBSD-src that would make the #2878 Leave_group call unnecessary would be to eliminate the error return on duplicate join_group calls. Not sure what is in the design spec makes rejecting a duplicate necessary. I haven't tested it, but might."

As a test, I removed the call to Leave_group from RADVD and removed the single line at 2038 from in6_mcast.c in addition to the applied fix to FreeBSD.

error = EINVAL;

IPV6 continued to perform correctly. This might be a more enduring solution.

Actions #25

Updated by Ronald Schellberg over 4 years ago

Luiz Souza wrote:

Fixed in FreeBSD, the port workaround is unnecessary now.

Thanks for all the details Ronald.

The snap built on Tue Jul 14 13:03:46 EDT 2020 has been running for 60+ hours now, so your commit appears to solve the issue.

Actions #26

Updated by Jim Pingle over 4 years ago

  • Status changed from Feedback to Resolved
Actions #27

Updated by Lars Veldcholte over 1 year ago

This problem returned for me after updating to pfSense 2.6.0.

Immediately after starting radvd, it starts spamming "can't joing ipv6-allrouters on $interface" in the logs, and router advertisements are not working.

Mar 26 12:45:14     radvd     50012     attempting to reread config file
Mar 26 12:45:14     radvd     50012     warning: AdvRDNSSLifetime <= 2*MaxRtrAdvInterval would allow stale DNS servers to be deleted faster
Mar 26 12:45:14     radvd     50012     warning: (/var/etc/radvd.conf:22) AdvRDNSSLifetime <= 2*MaxRtrAdvInterval would allow stale DNS servers to be deleted faster
Mar 26 12:45:14     radvd     50012     warning: AdvDNSSLLifetime <= 2*MaxRtrAdvInterval would allow stale DNS suffixes to be deleted faster
Mar 26 12:45:14     radvd     50012     can't join ipv6-allrouters on vtnet3
Mar 26 12:45:14     radvd     50012     can't join ipv6-allrouters on vtnet4
Mar 26 12:45:14     radvd     50012     can't join ipv6-allrouters on vtnet2
Mar 26 12:45:14     radvd     50012     can't join ipv6-allrouters on vtnet0
Mar 26 12:45:14     radvd     50012     resuming normal operation
Mar 26 12:45:14     radvd     50012     can't join ipv6-allrouters on vtnet3
Mar 26 12:45:14     radvd     50012     can't join ipv6-allrouters on vtnet4
Mar 26 12:45:14     radvd     50012     can't join ipv6-allrouters on vtnet2
Mar 26 12:45:14     radvd     50012     can't join ipv6-allrouters on vtnet0 
Actions #28

Updated by Lars Veldcholte over 1 year ago

Can this issue be reopened since it has reappeared in 2.6.0?

FWIW, I saw the same issue appeared in OPNsense, where they have since fixed it: https://forum.opnsense.org/index.php?topic=33148.msg160337

Actions

Also available in: Atom PDF