Bug #9577
closedradvd send_ra_forall failed on interface / can't join ipv6-allrouters
100%
Description
https://forum.netgate.com/topic/142363/ipv6-broken-radvd-can-t-join-ipv6-allrouters-on-interface/19
log is full of
Jun 6 18:57:34 radvd 66014 resuming normal operation
Jun 6 18:57:34 radvd 66014 attempting to reread config file
Jun 6 14:01:49 radvd 65728 version 2.17 started
Jun 6 14:00:39 radvd 67952 can't join ipv6-allrouters on igb2
Jun 6 14:00:39 radvd 67952 can't join ipv6-allrouters on ath0_wlan0
Jun 6 14:00:33 radvd 67952 can't join ipv6-allrouters on igb2
Jun 6 14:00:22 radvd 67952 can't join ipv6-allrouters on igb2
Jun 6 14:00:20 radvd 67952 can't join ipv6-allrouters on ath0_wlan0
Jun 6 14:00:08 radvd 67952 can't join ipv6-allrouters on igb2
Jun 6 14:00:01 radvd 67952 can't join ipv6-allrouters on ath0_wlan0
Jun 6 13:59:53 radvd 67952 can't join ipv6-allrouters on ath0_wlan0
Jun 6 13:59:53 radvd 67952 can't join ipv6-allrouters on igb2
Jun 6 13:59:40 radvd 67952 can't join ipv6-allrouters on ath0_wlan0
Jun 6 13:59:36 radvd 67952 can't join ipv6-allrouters on igb2
Jun 6 13:59:30 radvd 67952 can't join ipv6-allrouters on igb2
Jun 6 13:59:25 radvd 67952 can't join ipv6-allrouters on ath0_wlan0
more debug output
Jun 7 10:00:25 radvd 55719 polling for 16 second(s), next iface is ath0_wlan0
Jun 7 10:00:25 radvd 55719 igb1 next scheduled RA in 16 second(s)
Jun 7 10:00:25 radvd 55719 send_ra_forall failed on interface igb1
Jun 7 10:00:25 radvd 55719 not sending RA for igb1, interface is not ready
Jun 7 10:00:25 radvd 55719 can't join ipv6-allrouters on igb1
Jun 7 10:00:25 radvd 55719 igb1 address: fe80::a236:9fff:fe85:96f1
Jun 7 10:00:25 radvd 55719 igb1 address: xxxx:xxx:xx:xxx::1
Jun 7 10:00:25 radvd 55719 igb1 linklocal address: fe80::a236:9fff:fe85:96f1
Jun 7 10:00:25 radvd 55719 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jun 7 10:00:25 radvd 55719 checking ipv6 forwarding of interface not supported
Jun 7 10:00:25 radvd 55719 prefix length for igb1 is 64
Jun 7 10:00:25 radvd 55719 link layer token length for igb1 is 48
Jun 7 10:00:25 radvd 55719 mtu for igb1 is 1500
Jun 7 10:00:25 radvd 55719 igb1 supports multicast or is point-to-point
Jun 7 10:00:25 radvd 55719 igb1 is running
Jun 7 10:00:25 radvd 55719 igb1 is up
Jun 7 10:00:25 radvd 55719 ioctl(SIOCGIFFLAGS) succeeded on igb1
Jun 7 10:00:25 radvd 55719 timer_handler called for igb1
Jun 7 10:00:25 radvd 55719 polling for 0 second(s), next iface is igb1
Jun 7 10:00:25 radvd 55719 igb2 next scheduled RA in 16 second(s)
Jun 7 10:00:25 radvd 55719 send_ra_forall failed on interface igb2
Jun 7 10:00:25 radvd 55719 not sending RA for igb2, interface is not ready
Jun 7 10:00:25 radvd 55719 can't join ipv6-allrouters on igb2
Jun 7 10:00:25 radvd 55719 igb2 address: fe80::a236:9fff:fe85:96f2
Jun 7 10:00:25 radvd 55719 igb2 address: xxxx:xxx:xxx:xxxx::1
Jun 7 10:00:25 radvd 55719 igb2 linklocal address: fe80::a236:9fff:fe85:96f2
Jun 7 10:00:25 radvd 55719 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jun 7 10:00:25 radvd 55719 checking ipv6 forwarding of interface not supported
Jun 7 10:00:25 radvd 55719 prefix length for igb2 is 64
Jun 7 10:00:25 radvd 55719 link layer token length for igb2 is 48
Files
Updated by Manuel Piovan over 5 years ago
ipv6 gateway disappear from connected clients and ipv6 is not working anymore, i need to restart radvd to make it work again for some times
Updated by Greg M over 5 years ago
Now I have this as well:
Jun 29 07:17:29 radvd 62926 can't join ipv6-allrouters on hn0.10
Jun 29 07:15:22 radvd 62926 can't join ipv6-allrouters on hn0.10
Jun 29 07:15:00 radvd 62926 can't join ipv6-allrouters on hn0.9
Jun 29 07:13:07 radvd 62926 can't join ipv6-allrouters on hn0.7
Jun 29 07:12:47 radvd 62926 can't join ipv6-allrouters on hn0.10
Jun 29 07:11:25 radvd 62926 can't join ipv6-allrouters on hn0.8
Jun 29 07:11:23 radvd 62926 can't join ipv6-allrouters on hn0.9
Jun 29 07:10:22 radvd 62926 can't join ipv6-allrouters on hn0.10
Jun 29 07:08:10 radvd 62926 can't join ipv6-allrouters on hn0.10
Updated by Greg M over 5 years ago
Now I don`t have above any more but I have this (but everything is working just fine):
Jul 22 14:44:54 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:43:25 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:41:56 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:41:20 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:40:03 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:39:37 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:37:53 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:37:32 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:36:42 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:34:32 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:31:44 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:30:26 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:29:31 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Jul 22 14:29:21 radvd 40666 IPv6 forwarding on interface seems to be disabled, but continuing anyway
Updated by Manuel Piovan over 5 years ago
Greg M wrote:
IPv6 forwarding on interface seems to be disabled, but continuing anywayNow I don`t have above any more but I have this (but everything is working just fine):
confirming this, same here
radvd is now 2.18
Updated by Greg M about 5 years ago
Hi!
Can someone PLEASE take a look at this one.
Thanks!
Updated by Ronald Schellberg about 5 years ago
- File Example sequence.docx Example sequence.docx added
There are multiple issues, some easily solved. The "disabled" logging message can be deleted, as it is just an indication that for FreeBSD the feature is stubbed out. I can submit a RADVD patch file for interface.c to delete 5 lines.
I have been bashing away at this for several weeks now and need some advise from Netgate whether to continue on with the 2.5 version or focusing more on changes that have been made to stable/12.
I have tried incorporating some of stable/12 and the issue still exists but to a lesser extent, having seen that stable/12 doesn't solve the problem, I have switched back to 2.5.
What I have found is an issue with the FreeBSD in6p_leave_group, every other call, it finds and removes the desired group. The subsequent call to in6p_join_group reinserts the group, but not correctly. The pointer to ifp should be listed in the last entry of the list but is NULL. The next leave/join group cycle in RADVD, the in6p_leave_group fails to find the entry (duh the entry is NULL at this point) and since none found it exits. Well, the subsequent call to in6p_join_group also does not find the entry, so the list is incremented and the entry correctly added until the "radvd can't join ipv6-allrouters" condition occurs (somewhere between 1000 and 2000 leave/join cycles or about 24 hours for me). It would be nice if the leave/join implementation of RADVD was not necessary.
I attached a notated word document showing 4 RADVD leave/join cycles with numerous added log messages that details the above sequence.
I can continue to bash away at this on 2.5 but if the changes in stable/12 are going to get incorporated soon before 2.5 is released, my time may be better spent testing and fixing it.
Updated by Jim Pingle about 5 years ago
2.5 will be moving to a 12.1 or stable/12 base, but that choice has not yet been made. It definitely will not stay on 12.0, though.
Even if 12.1 is selected, if specific changes to stable/12 after 12.1-RELEASE are beneficial, we can pick those back if needed.
Updated by Ronald Schellberg about 5 years ago
After several failed attempts at creating a 12.1 version, the process that worked was to create a new branch from pfSense/releng/12.1 then cherry-picking commits from the 2.5 branch since mid-February. I also applied my 6RD patch to this branch as I need the stf changes to get ipv6 working for me.
That patch caused a kernel panic and a reboot on my bare metal firewall, that was impossible to capture on the vga console. So I switched tactics, and created hyper-v VM instance on my build machine which has two hardware network interfaces but I needed an ISO with a serial console to capture the console spew. Read multiple rebuilds over the last 20 days. Last night I finally have a version that successfully installs and boots.
With similar logging added to sys/netinet6/in6_mcast.c, I can confirm that releng/12.1 appears not to have the same issues that 2.5 has since 12.1 rewrote the internals in6_mcast.c. RADVD has been running about 5 hours now and I expect it to continue like the 2.4 branch. I can confirm tomorrow, as it would stop working for me after about 24 hours.
I would like to try removing the IPV6_LEAVE_GROUP call from the bsd44.c patch of RADVD to see if that is still necessary, but want to make sure this version is stable first.
Updated by Ronald Schellberg about 5 years ago
Ronald Schellberg wrote:
I can confirm tomorrow, as it would stop working for me after about 24 hours.
I would like to try removing the IPV6_LEAVE_GROUP call from the bsd44.c patch of RADVD to see if that is still necessary, but want to make sure this version is stable first.
Rebuilt a clean version (without logging and debug) and that has been running on the VM for almost 2 days. Now installed it on my bare metal main router.
On a side note, why has issue dropped from the 2.5 issue list????
Updated by Jim Pingle about 5 years ago
- Target version set to 2.5.0
Ronald Schellberg wrote:
On a side note, why has issue dropped from the 2.5 issue list????
It was never assigned a target version, so it was never on that list, so it couldn't be "dropped" from the list.
I've added it now, it definitely needs addressed before release, but from the looks of the other info here and in the forum thread it may solve itself once we move the base to 12.1.
The workaround from the forum thread isn't pretty, but it does work. Add a cron job for:
0 * * * * root /usr/bin/killall radvd && /bin/sleep 5 && /usr/local/sbin/radvd -p /var/run/radvd.pid -C /var/etc/radvd.conf -m syslog
I haven't tested it, but this would probably also work:
/usr/local/sbin/pfSsh.php playback svc stop radvd && /bin/sleep 5 && /usr/local/sbin/pfSsh.php playback svc start radvd
Updated by Ronald Schellberg almost 5 years ago
After shifting from RELENG 12.1 to Stable/12, I noticed that the commit labeled MFC r355881 on 12/25/19 again triggered the "can't join ipv6-allrouters" problem in RADVD. Reverting the commit, resolved the issue again. The problem is RADVD implementation in pfSense performs IPV6_LEAVE_GROUP/IPV6_JOIN_GROUP sequence every 10 to 15 secs (the timing is randomly selected). These two calls do not communicate well together causing the multicast tables to slowly fill up until the "can't join " error occurs, typically 24+ hours later, then RADVD begins to fail.
I have spent a while trying to chase this down, when a comment by JimP on another issue got me thinking there could well be another solution to the problem. While it doesn't fix the FreeBSD problem described above, it may well be an acceptable solution. This solution should work on any version of FreeBSD. I modified the patch-device-bsd44.c file in the RADVD port by inserting a check to see if the socket has already joined a group, if so, return without leaving and rejoining the group. As far as I can tell, RADVD doesn't contemplate ever leaving a group.
See the code patch below:
@-int setup_allrouters_membership(int sock, struct Interface *iface) { return 0; }
+#define MAX_IFACE 10
+int setup_allrouters_membership(int sock, struct Interface *iface)
+{
+ static int socket_count = 0;
+ static int msockets[MAX_IFACE] = {};
+ int i;
+ struct ipv6_mreq mreq;
+
+ for (i=0;i<socket_count;i++) {
+ if (msockets[i] == sock) {
+ return 0;
+ }
+ }
+ if (socket_count < MAX_IFACE-1) {
+ msockets[socket_count] = sock;
+ socket_count++;
+ }
+
+ memset(&mreq, 0, sizeof(mreq));
+ mreq.ipv6mr_interface = iface->props.if_index;
+
+ /* all-routers multicast address */
+ if (inet_pton(AF_INET6, "ff02::2",
+ &mreq.ipv6mr_multiaddr.s6_addr) != 1) {
+ flog(LOG_ERR, "inet_pton failed");
+ return (-1);
+ }
+
+ if (setsockopt(sock, IPPROTO_IPV6, IPV6_JOIN_GROUP,
+ &mreq, sizeof(mreq)) < 0) {
+ flog(LOG_ERR, "can't join ipv6-allrouters on %s", iface->props.name);
+ return (-1);
+ }
+
+ return 0;
+}
@
I just used a simple array to track up to 10 interfaces, if more are needed it could be altered to a linked-list or simply expanded. I was just not sure how many might get created. Looking at the responses above 10 may not be sufficient.
This appears to work, my test VM has been running for more than 24 hours and my IPv6 is still 10/10 on test-ipv6.com.
Updated by Ronald Schellberg almost 5 years ago
Attached is a compiled RADVD for 2.5 with the above patch (slightly modified) incorporated. Added a logging message when a socket is added to the msocket array and additional information was added to the "can't join" message if the IPV6_JOIN_GROUP call fails. Bumped the dimension of the msocket array to 50 for good measure.
The ravdv-2.18_5-v2.5test.txz file is attached.
Updated by Jim Pingle almost 5 years ago
Is there a pull request on Github for this? I don't see one. If there is not, can you submit that source change as a pull request on Github?
https://docs.netgate.com/pfsense/en/latest/development/submitting-a-pull-request-via-github.html
Updated by Ronald Schellberg almost 5 years ago
There is not one yet, waiting for some confirmation from others. I'll submit one latter tonight.
Updated by Ronald Schellberg over 4 years ago
Ronald Schellberg wrote:
The ravdv-2.18_5-v2.5test.txz file is attached.
My bare metal router running my version of 2.5 has been up 14 days now and still 10/10 on test-ipv6.com
My VM version on stable 12 also showing similar results however it tends to be rebuilt more often to incorporate the latest commits.
Updated by Michael Smith over 4 years ago
Ronald Schellberg wrote:
Pull Request # 773 submitted
Can you add a link to the PR?
Updated by Ronald Schellberg over 4 years ago
Can you add a link to the PR?
[[https://github.com/pfsense/FreeBSD-ports/pull/773]]
Updated by Ronald Schellberg over 4 years ago
- File radvd-2.18_5.txz radvd-2.18_5.txz added
Ronald Schellberg wrote:
Attached is a compiled RADVD for 2.5 with the above patch (slightly modified) incorporated. Added a logging message when a socket is added to the msocket array and additional information was added to the "can't join" message if the IPV6_JOIN_GROUP call fails. Bumped the dimension of the msocket array to 50 for good measure.
The ravdv-2.18_5-v2.5test.txz file is attached.
Attached is a updated RADVD compiled with current 2.5 stable-12 branch.
For those experiencing IPV6 failures after 24 or so hours due to RADVD consider:- uploading this file to the router /TMP directory
- issuing a "pkg install -y /tmp/radvd-2.18_5.txz" command
- reboot
Confirm messages like below in your routing log to make sure the new version is applied:
Jun 18 21:13:49 radvd 32115 version 2.18 started
Jun 18 21:13:49 radvd 32327 adding ipv6-allrouters on hn1, sock: 4, iface->props.if_index:6
The patch should resolve the issue until PR #773 gets incorporated.
I have had installs run for more than 35 days using this patch, only to be stopped for other 2.5 updates.
Updated by Michael Geiger over 4 years ago
The patch should resolve the issue until PR #773 gets incorporated.
I have had installs run for more than 35 days using this patch, only to be stopped for other 2.5 updates.
Thanks a lot for your contribution. I installed your patched radvd and will test it also.
Do we know what currently prevent your patch from being merged?
Updated by Louis B over 4 years ago
Hi,
I installed the patch and a lot of messages where gone. What was in the log after reboot is
Jul 6 12:31:08 pfSense radvd23095: adding ipv6-allrouters on lagg0.88, sock: 4, iface->props.if_index:19
Jul 6 12:31:08 pfSense radvd22990: version 2.18 started
Must say, perhaps that is OK, but I do not at all understand the first line. I have many vlans so it is strange to me that one of them is mentioned here (vlan88)
With my config, appart of the messages, I think ... all IPV6 was and is working correcty.
Note that I did perform any testing appart of some IPV6 pings.
Louis
Updated by Luiz Souza over 4 years ago
- Status changed from New to Feedback
- Assignee set to Luiz Souza
- % Done changed from 0 to 100
Fixed in FreeBSD, the port workaround is unnecessary now.
Thanks for all the details Ronald.
Updated by Ronald Schellberg over 4 years ago
Don't know that anyone has noticed but the build system has stopped posting snaps since 7/9 00:50, which makes it more difficult to provide feedback on this and other recent changes. :-)
I confirmed that the 7/9 00:50 version begins to fail after 28:30 hours, so I reverted and rebased my local build this weekend. I can confirm that both my VM and bare metal installation go beyond that point and are continuing without error. Not monitoring/logging the call sequence, my only concern that is it a full fix or did it just push the issue down the road a bit. Time will tell.
One additional change FreeBSD-src that would make the #2878 Leave_group call unnecessary would be to eliminate the error return on duplicate join_group calls. Not sure what is in the design spec makes rejecting a duplicate necessary. I haven't tested it, but might.
Updated by Ronald Schellberg over 4 years ago
"One additional change FreeBSD-src that would make the #2878 Leave_group call unnecessary would be to eliminate the error return on duplicate join_group calls. Not sure what is in the design spec makes rejecting a duplicate necessary. I haven't tested it, but might."
As a test, I removed the call to Leave_group from RADVD and removed the single line at 2038 from in6_mcast.c in addition to the applied fix to FreeBSD.
error = EINVAL;
IPV6 continued to perform correctly. This might be a more enduring solution.
Updated by Ronald Schellberg over 4 years ago
Luiz Souza wrote:
Fixed in FreeBSD, the port workaround is unnecessary now.
Thanks for all the details Ronald.
The snap built on Tue Jul 14 13:03:46 EDT 2020 has been running for 60+ hours now, so your commit appears to solve the issue.
Updated by Jim Pingle over 4 years ago
- Status changed from Feedback to Resolved
Updated by Lars Veldcholte over 1 year ago
This problem returned for me after updating to pfSense 2.6.0.
Immediately after starting radvd, it starts spamming "can't joing ipv6-allrouters on $interface" in the logs, and router advertisements are not working.
Mar 26 12:45:14 radvd 50012 attempting to reread config file Mar 26 12:45:14 radvd 50012 warning: AdvRDNSSLifetime <= 2*MaxRtrAdvInterval would allow stale DNS servers to be deleted faster Mar 26 12:45:14 radvd 50012 warning: (/var/etc/radvd.conf:22) AdvRDNSSLifetime <= 2*MaxRtrAdvInterval would allow stale DNS servers to be deleted faster Mar 26 12:45:14 radvd 50012 warning: AdvDNSSLLifetime <= 2*MaxRtrAdvInterval would allow stale DNS suffixes to be deleted faster Mar 26 12:45:14 radvd 50012 can't join ipv6-allrouters on vtnet3 Mar 26 12:45:14 radvd 50012 can't join ipv6-allrouters on vtnet4 Mar 26 12:45:14 radvd 50012 can't join ipv6-allrouters on vtnet2 Mar 26 12:45:14 radvd 50012 can't join ipv6-allrouters on vtnet0 Mar 26 12:45:14 radvd 50012 resuming normal operation Mar 26 12:45:14 radvd 50012 can't join ipv6-allrouters on vtnet3 Mar 26 12:45:14 radvd 50012 can't join ipv6-allrouters on vtnet4 Mar 26 12:45:14 radvd 50012 can't join ipv6-allrouters on vtnet2 Mar 26 12:45:14 radvd 50012 can't join ipv6-allrouters on vtnet0
Updated by Lars Veldcholte over 1 year ago
Can this issue be reopened since it has reappeared in 2.6.0?
FWIW, I saw the same issue appeared in OPNsense, where they have since fixed it: https://forum.opnsense.org/index.php?topic=33148.msg160337