Project

General

Profile

Actions

Bug #8506

closed

Constant link cycling on some DHCP interfaces causes connectivity problems and other issues

Added by Jim Pingle over 6 years ago. Updated over 6 years ago.

Status:
Duplicate
Priority:
Very High
Assignee:
-
Category:
Interfaces
Target version:
Start date:
05/10/2018
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
2.4.4
Affected Architecture:
amd64

Description

Since the switch to 11-stable on 2.4.4 snapshots, it appears that in some cases a DHCP WAN interface will constantly cycle link up/down for no apparent reason. The link cycling leads to other services restarting so often as to render the system mostly unusable.

I have only seen it happen on physical igb interfaces so far. I can't replicate it in a VM or on ARM (SG-1000 and SG-3100 are OK)

It happens with and without a spoofed MAC address on the interface.

        <opt1>
            <descr><![CDATA[Cable]]></descr>
            <if>igb2</if>
            <enable></enable>
            <alias-address></alias-address>
            <alias-subnet>32</alias-subnet>
            <spoofmac>00:xx:xx:xx:xx:xx</spoofmac>
            <monitorip>x.x.x.1</monitorip>
            <ipaddr>dhcp</ipaddr>
            <dhcphostname></dhcphostname>
            <dhcprejectfrom>192.168.100.1</dhcprejectfrom>
            <adv_dhcp_pt_timeout></adv_dhcp_pt_timeout>
            <adv_dhcp_pt_retry></adv_dhcp_pt_retry>
            <adv_dhcp_pt_select_timeout></adv_dhcp_pt_select_timeout>
            <adv_dhcp_pt_reboot></adv_dhcp_pt_reboot>
            <adv_dhcp_pt_backoff_cutoff></adv_dhcp_pt_backoff_cutoff>
            <adv_dhcp_pt_initial_interval></adv_dhcp_pt_initial_interval>
            <adv_dhcp_pt_values>SavedCfg</adv_dhcp_pt_values>
            <adv_dhcp_send_options></adv_dhcp_send_options>
            <adv_dhcp_request_options></adv_dhcp_request_options>
            <adv_dhcp_required_options></adv_dhcp_required_options>
            <adv_dhcp_option_modifiers></adv_dhcp_option_modifiers>
            <adv_dhcp_config_advanced></adv_dhcp_config_advanced>
            <adv_dhcp_config_file_override></adv_dhcp_config_file_override>
            <adv_dhcp_config_file_override_path></adv_dhcp_config_file_override_path>
        </opt1>
May 10 20:00:02 blooper kernel: igb2: link state changed to DOWN
May 10 20:00:03 blooper check_reload_status: rc.newwanip starting igb2
May 10 20:00:04 blooper php-fpm[36293]: /rc.linkup: dpinger: timeout while retrieving status for gateway HeNet_GW
May 10 20:00:04 blooper php-fpm[36293]: /rc.linkup: Gateways status could not be determined, considering all as up/active. (Group: PreferCableV6)
May 10 20:00:04 blooper php-fpm[36293]: /rc.linkup: Gateways status could not be determined, considering all as up/active. (Group: PreferDSLV6)
May 10 20:00:04 blooper php-fpm[36293]: /rc.linkup: Gateways status could not be determined, considering all as up/active. (Group: LoadBalanceV6)
May 10 20:00:04 blooper kernel: igb2: link state changed to UP
May 10 20:00:04 blooper check_reload_status: Linkup starting igb2
May 10 20:00:04 blooper php-fpm[41445]: /rc.newwanip: dpinger: timeout while retrieving status for gateway HeNet_DSLGW
May 10 20:00:05 blooper php-fpm[45686]: /rc.newwanip: dpinger: timeout while retrieving status for gateway HeNet_DSLGW
May 10 20:00:07 blooper php-fpm[36293]: /rc.linkup: dpinger: timeout while retrieving status for gateway HeNet_DSLGW
May 10 20:00:07 blooper php-fpm[41445]: /rc.newwanip: dpinger: timeout while retrieving status for gateway HeNet_GW
May 10 20:00:07 blooper php-fpm[41445]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: PreferCableV6)
May 10 20:00:07 blooper php-fpm[41445]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: PreferDSLV6)
May 10 20:00:07 blooper php-fpm[41445]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: LoadBalanceV6)
May 10 20:00:07 blooper php-fpm[41445]: /rc.newwanip: phpDynDNS (vpnpeer.pingle.org): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
May 10 20:00:08 blooper php-fpm[45686]: /rc.newwanip: dpinger: timeout while retrieving status for gateway HeNet_GW
May 10 20:00:08 blooper php-fpm[45686]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: PreferCableV6)
May 10 20:00:08 blooper php-fpm[45686]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: PreferDSLV6)
May 10 20:00:08 blooper php-fpm[45686]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: LoadBalanceV6)
May 10 20:00:10 blooper check_reload_status: Linkup starting igb2
May 10 20:00:10 blooper kernel: igb2: link state changed to DOWN
May 10 20:00:10 blooper check_reload_status: rc.newwanip starting igb2
May 10 20:00:10 blooper php-fpm[36293]: /rc.linkup: dpinger: timeout while retrieving status for gateway HeNet_GW
May 10 20:00:10 blooper php-fpm[45686]: /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1525996810] unbound[15130:0] error: bind: address already in use [1525996810] unbound[15130:0] fatal error: could not open ports'
May 10 20:00:10 blooper dhcpleases: kqueue error: unkown
May 10 20:00:10 blooper php-fpm[36293]: /rc.linkup: Gateways status could not be determined, considering all as up/active. (Group: PreferCableV6)
May 10 20:00:10 blooper php-fpm[36293]: /rc.linkup: Gateways status could not be determined, considering all as up/active. (Group: PreferDSLV6)
May 10 20:00:10 blooper php-fpm[36293]: /rc.linkup: Gateways status could not be determined, considering all as up/active. (Group: LoadBalanceV6)
May 10 20:00:10 blooper check_reload_status: updating dyndns opt1
May 10 20:00:10 blooper check_reload_status: Reloading filter
May 10 20:00:10 blooper php-fpm[85682]: /rc.linkup: DEVD Ethernet detached event for opt1
May 10 20:00:11 blooper php-fpm[326]: /rc.newwanip: dpinger: timeout while retrieving status for gateway HeNet_DSLGW
May 10 20:00:11 blooper check_reload_status: Reloading filter
May 10 20:00:11 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper php-fpm[30692]: /rc.linkup: DEVD Ethernet attached event for opt1
May 10 20:00:12 blooper php-fpm[30692]: /rc.linkup: HOTPLUG: Configuring interface opt1
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper php-fpm[41445]: /rc.newwanip: dpinger: timeout while retrieving status for gateway HeNet_DSLGW
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:12 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:13 blooper check_reload_status: Linkup starting igb2
May 10 20:00:13 blooper kernel: igb2: link state changed to UP
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper php-fpm[326]: /rc.newwanip: dpinger: timeout while retrieving status for gateway HeNet_GW
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper php-fpm[326]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: PreferCableV6)
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper kernel: arpresolve: can't allocate llinfo for X.X.X.1 on igb2
May 10 20:00:14 blooper php-fpm[326]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: PreferDSLV6)
May 10 20:00:14 blooper php-fpm[326]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: LoadBalanceV6)
May 10 20:00:14 blooper check_reload_status: Linkup starting igb2
May 10 20:00:14 blooper kernel: igb2: link state changed to DOWN
May 10 20:00:14 blooper kernel: nd6_setmtu0: new link MTU on igb2 (576) is too small for IPv6
May 10 20:00:14 blooper check_reload_status: rc.newwanip starting igb2
May 10 20:00:14 blooper snmpd[15821]: SIOCGIFDESCR (gif0): Device not configured
May 10 20:00:14 blooper kernel: gif0: link state changed to DOWN
May 10 20:00:14 blooper snmpd[15821]: SIOCGIFDESCR (gif0): Device not configured
May 10 20:00:14 blooper kernel: gif0: link state changed to UP
May 10 20:00:15 blooper php-fpm[41445]: /rc.newwanip: dpinger: timeout while retrieving status for gateway HeNet_GW
May 10 20:00:16 blooper php-fpm[41445]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: PreferCableV6)
May 10 20:00:16 blooper php-fpm[41445]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: PreferDSLV6)
May 10 20:00:16 blooper php-fpm[41445]: /rc.newwanip: Gateways status could not be determined, considering all as up/active. (Group: LoadBalanceV6)
May 10 20:00:16 blooper kernel: igb2: link state changed to UP
May 10 20:00:16 blooper check_reload_status: Linkup starting igb2
May 10 20:00:16 blooper php-fpm[45686]: /rc.newwanip: dpinger: timeout while retrieving status for gateway HeNet_DSLGW
May 10 20:00:16 blooper check_reload_status: Linkup starting igb2
May 10 20:00:16 blooper kernel: igb2: link state changed to DOWN
May 10 20:00:17 blooper check_reload_status: rc.newwanip starting igb2
May 10 20:00:18 blooper php-fpm[326]: /rc.newwanip: dpinger: timeout while retrieving status for gateway HeNet_DSLGW
May 10 20:00:18 blooper php-fpm[30692]: /rc.linkup: dpinger: timeout while retrieving status for gateway HeNet_DSLGW
May 10 20:00:18 blooper kernel: igb2: link state changed to UP
May 10 20:00:18 blooper check_reload_status: Linkup starting igb2
Actions #1

Updated by Jim Pingle over 6 years ago

So far I haven't found any relevant common elements between the two systems that can replicate the problem.

That NIC shows a large number of link interrupts as well:

interrupt                          total       rate
irq262: igb2:que 0                 41266         35
irq263: igb2:que 1                  9468          8
irq264: igb2:link                    449          0

That matches up with the number of link transitions logged:

$ grep -c 'igb2: link state changed' OS-Message\ Buffer.txt 
447

Other NICs on that box had <=3 events

There are other systems with igb NICs running DHCP WANs that do not appear to hit this issue so it's still unclear what is triggering the problem.

Actions #2

Updated by Jim Pingle over 6 years ago

May or may not be related but even on the non-igb hardware I can set this into a link cycle/wan reconfigure loop by doing an edit/save/apply on the WAN, and that didn't happen on previous releases.

Actions #3

Updated by Jim Pingle over 6 years ago

This may end up being the same root cause as #8507, dhclient in FreeBSD gained support for MTU, but setting MTU on e1000 causes the link to cycle. dhclient was altered to not exit on link drop, but that doesn't help our other link processing scripts, which fall into a never-ending loop. See #8507#note-2

Actions #4

Updated by Jim Pingle over 6 years ago

  • Status changed from Confirmed to Feedback

Working under the assumption this is related to #8507, a patch was added to help address the issue. If it's still broken when #8507 is OK, then this can be treated as a separate issue.

Actions #5

Updated by Jim Pingle over 6 years ago

  • Status changed from Feedback to Duplicate

Closing this as a duplicate of #8506 -- they had the same root cause, and the information on #8507 is closer to the root cause (MTU from DHCP caused a link bounce loop in this case)

Actions

Also available in: Atom PDF