Bug #7119
closedChanging LAGG attributes results in a panic/crash
100%
Description
On 2.4, when changing attributes of an assigned LAGG such as the mode or membership, the firewall panics and reboots.
Tested on an 8860 and 4860, so it may be specific to igb. In this case, the lagg instance contained igb4,igb5 in LACP mode, and I attempted to change the mode to Failover. bjaffe encountered the same crash when changing member interfaces.
Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 04 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80e190c0 stack pointer = 0x28:0xfffffe022c32fa30 frame pointer = 0x28:0xfffffe022c32fa50 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi6: task queue)
db:0:kdb.enter.default> show pcpu cpuid = 2 dynamic pcpu = 0xfffffe02a9c86f00 curthread = 0xfffff80006250500: pid 12 "swi6: task queue" curpcb = 0xfffffe022c32fcc0 fpcurthread = none idlethread = 0xfffff80006233500: tid 100005 "idle: cpu2" curpmap = 0xffffffff829e5600 tssp = 0xffffffff82a1dee0 commontssp = 0xffffffff82a1dee0 rsp0 = 0xfffffe022c32fcc0 gs32p = 0xffffffff82a24738 ldt = 0xffffffff82a24778 tss = 0xffffffff82a24768 db:0:kdb.enter.default> bt Tracing pid 12 tid 100023 td 0xfffff80006250500 arp_iflladdr() at arp_iflladdr+0x10/frame 0xfffffe022c32fa50 lagg_port_setlladdr() at lagg_port_setlladdr+0x14e/frame 0xfffffe022c32faa0 taskqueue_run_locked() at taskqueue_run_locked+0x14a/frame 0xfffffe022c32fb00 taskqueue_run() at taskqueue_run+0xbf/frame 0xfffffe022c32fb20 intr_event_execute_handlers() at intr_event_execute_handlers+0x20f/frame 0xfffffe022c32fb60 ithread_loop() at ithread_loop+0xc6/frame 0xfffffe022c32fbb0 fork_exit() at fork_exit+0x85/frame 0xfffffe022c32fbf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe022c32fbf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Updated by Rolf Sommerhalder almost 8 years ago
Jim Pingle wrote:
On 2.4, when changing attributes of an assigned LAGG such as the mode or membership, the firewall panics and reboots.
Tested on an 8860 and 4860, so it may be specific to igb. In this case, the lagg instance contained igb4,igb5 in LACP mode, and I attempted to change the mode to Failover. bjaffe encountered the same crash when changing member interfaces.
With 2.4 amd64 Snapshot on Supermicro SuperServers 5018D-FN8T with X10SDV-TP8F motherboards, for example changing an IP address of a VLAN on LAGG interfaces igb1,igb2,igb3 that uses LACP also panics, and the kernel hangs subsequently.
It requires a manual Reset or Power Cycle, using BMC/IPMI from remote. Fortunately it will restart, and the changes will then take effect.
For such situations, getting the Watch Dog to work would be helpful, which is available in the BIOS...
Updated by Renato Botelho almost 8 years ago
- Status changed from New to Feedback
- Assignee set to Renato Botelho
- % Done changed from 0 to 100
I've cherry-picked FreeBSD-src patches that should fix it:
https://svnweb.freebsd.org/base?view=revision&revision=310180
https://svnweb.freebsd.org/base?view=revision&revision=310327
Updated by Jim Pingle almost 8 years ago
- Status changed from Feedback to Confirmed
Still crashes on the latest factory snapshot: Wed Jan 18 19:49:46 CST 2017
Updated by Renato Botelho almost 8 years ago
I couldn't reproduce it on a VM using em driver, probably something specific to igb as mentioned
Updated by Rolf Sommerhalder almost 8 years ago
Snapshots from this morning still crash with igb hardware NICs.
Updated by Rolf Sommerhalder almost 8 years ago
To be more precise: pfSense does not exactly "crash", as it is still ping-able. And SSH shells that were open from before the "crash" remain connected, while still being able to type commands, but do not return answers.
Only reset or power-cycle gets it out of this state (did not managed to get Watch Dog working yet).
Thereafter, the changes made to LAGG right before the "crash" take effect.
Updated by Jim Pingle almost 8 years ago
Here, it still panics + dumps + reboots same as it did originally.
Updated by Renato Botelho almost 8 years ago
- Assignee changed from Renato Botelho to Luiz Souza
Updated by Luiz Souza almost 8 years ago
- Status changed from Confirmed to Feedback
Fixed in latest snapshot.
Relevant commits:
https://github.com/pfsense/FreeBSD-src/commit/b5996bd8278c710ce6859cfae2208e175e9b1171
https://github.com/pfsense/FreeBSD-src/commit/a86883d40fbb81454f6e44c6a759c0142408912d
Updated by Jim Pingle almost 8 years ago
Seems better now, it doesn't crash. Logs of activity in the log, though:
Jan 27 19:47:40 master snmpd[47102]: SIOCGIFDESCR (lagg0): Device not configured Jan 27 19:47:40 master kernel: igb4: lagg_port_destroy: lp_ifflags unclean Jan 27 19:47:40 master kernel: igb5: lagg_port_destroy: lp_ifflags unclean Jan 27 19:47:40 master kernel: lagg0: promiscuous mode disabled Jan 27 19:47:40 master check_reload_status: Linkup starting lagg0 Jan 27 19:47:40 master kernel: lagg0: link state changed to DOWN Jan 27 19:47:40 master check_reload_status: Syncing firewall Jan 27 19:47:40 master php-fpm[43135]: /interfaces_lagg_edit.php: Beginning https://portal.pfsense.org configuration backup. Jan 27 19:47:41 master check_reload_status: Reloading filter Jan 27 19:47:43 master php-fpm[43135]: /interfaces_lagg_edit.php: End of portal.pfsense.org configuration backup (success). Jan 27 19:47:43 master snmpd[47102]: SIOCGIFDESCR (lagg0_vlan10): Device not configured Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:43 master kernel: carp: demoted by -240 to 240 (vhid removed) Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:43 master kernel: carp: demoted by -240 to 0 (vhid removed) Jan 27 19:47:43 master kernel: lagg0_vlan10: promiscuous mode disabled Jan 27 19:47:43 master kernel: vlan0: changing name to 'lagg0_vlan10' Jan 27 19:47:43 master snmpd[47102]: SIOCGIFDESCR (lagg0_vlan10): Device not configured Jan 27 19:47:43 master snmpd[47102]: SIOCGIFDESCR (vlan0): Device not configured Jan 27 19:47:43 master kernel: lagg0: promiscuous mode enabled Jan 27 19:47:43 master kernel: lagg0_vlan10: promiscuous mode enabled Jan 27 19:47:43 master check_reload_status: Restarting ipsec tunnels Jan 27 19:47:43 master kernel: carp: demoted by 240 to 240 (interface down) Jan 27 19:47:43 master kernel: carp: demoted by 240 to 480 (interface down) Jan 27 19:47:45 master check_reload_status: updating dyndns opt2 Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:45 master kernel: carp: demoted by -240 to 240 (vhid removed) Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3 Jan 27 19:47:45 master kernel: carp: demoted by -240 to 0 (vhid removed) Jan 27 19:47:45 master kernel: lagg0: promiscuous mode disabled Jan 27 19:47:45 master kernel: lagg0_vlan10: promiscuous mode disabled Jan 27 19:47:46 master snmpd[47102]: SIOCGIFDESCR (lagg0_vlan20): Device not configured Jan 27 19:47:46 master kernel: lagg0: promiscuous mode enabled Jan 27 19:47:46 master kernel: lagg0_vlan10: promiscuous mode enabled Jan 27 19:47:46 master kernel: carp: demoted by 240 to 240 (interface down) Jan 27 19:47:46 master kernel: carp: demoted by 240 to 480 (interface down) Jan 27 19:47:46 master kernel: vlan1: changing name to 'lagg0_vlan20' Jan 27 19:47:46 master snmpd[47102]: SIOCGIFDESCR (vlan1): Device not configured Jan 27 19:47:59 master php-fpm[94047]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing. Jan 27 19:47:59 master check_reload_status: Reloading filter
If that is normal/expected then we can close this.
Updated by Luiz Souza almost 8 years ago
Yes, the messages does not seem related with the original bug (crash at ifconfig laggX destroy).
Let's open a new ticket to track these warnings.
Updated by Luiz Souza almost 8 years ago
- Status changed from Feedback to Resolved
Updated by Michael OBrien about 7 years ago
Luiz Souza wrote:
Yes, the messages does not seem related with the original bug (crash at ifconfig laggX destroy).
Let's open a new ticket to track these warnings.
Was this new ticket opened? When I change LAGG interface settings via the pfSense GUI or a command prompt, my pfSense 2.4.1 box (using igb drivers) cannot ping anything on the LAGG until I completely reboot it.
Nothing interesting in dmesg. Here's what shows up in system.log - you'll see a lot of sync noise, but this happened before HA was configured as well.
Oct 25 11:43:26 fw-lvdc-01 check_reload_status: Syncing firewall Oct 25 11:43:27 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.0.2:443/xmlrpc.php. Oct 25 11:43:27 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: XMLRPC reload data success with https://172.16.0.2:443/xmlrpc.php (pfsense.host_firmware_version). Oct 25 11:43:27 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: XMLRPC versioncheck: 17.3 -- 17.3 Oct 25 11:43:27 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.0.2:443/xmlrpc.php. Oct 25 11:43:28 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: XMLRPC reload data success with https://172.16.0.2:443/xmlrpc.php (pfsense.restore_config_section). Oct 25 11:43:28 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.0.2:443/xmlrpc.php. Oct 25 11:43:28 fw-lvdc-01 check_reload_status: Linkup starting igb2 Oct 25 11:43:28 fw-lvdc-01 kernel: igb2: link state changed to DOWN Oct 25 11:43:28 fw-lvdc-01 kernel: igb3: link state changed to DOWN Oct 25 11:43:28 fw-lvdc-01 kernel: lagg0: link state changed to DOWN Oct 25 11:43:28 fw-lvdc-01 check_reload_status: Restarting ipsec tunnels Oct 25 11:43:28 fw-lvdc-01 check_reload_status: Linkup starting igb3 Oct 25 11:43:28 fw-lvdc-01 check_reload_status: Linkup starting lagg0 Oct 25 11:43:29 fw-lvdc-01 check_reload_status: Reloading filter Oct 25 11:43:29 fw-lvdc-01 check_reload_status: Reloading filter Oct 25 11:43:29 fw-lvdc-01 php-fpm[89611]: /rc.linkup: Hotplug event detected for MGMT(lan) static IP (10.50.1.1 ) Oct 25 11:43:30 fw-lvdc-01 check_reload_status: updating dyndns lan Oct 25 11:43:31 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: XMLRPC reload data success with https://172.16.0.2:443/xmlrpc.php (pfsense.filter_configure). Oct 25 11:43:32 fw-lvdc-01 check_reload_status: Linkup starting igb2 Oct 25 11:43:32 fw-lvdc-01 kernel: igb2: link state changed to UP Oct 25 11:43:32 fw-lvdc-01 kernel: lagg0: link state changed to UP Oct 25 11:43:32 fw-lvdc-01 check_reload_status: Linkup starting lagg0 Oct 25 11:43:32 fw-lvdc-01 check_reload_status: Linkup starting igb3 Oct 25 11:43:32 fw-lvdc-01 kernel: igb3: link state changed to UP Oct 25 11:43:32 fw-lvdc-01 check_reload_status: Reloading filter Oct 25 11:43:32 fw-lvdc-01 php-fpm[87800]: /interfaces.php: Creating rrd update script Oct 25 11:43:33 fw-lvdc-01 php-fpm[87800]: /rc.linkup: Hotplug event detected for MGMT(lan) static IP (10.50.1.1 ) Oct 25 11:43:33 fw-lvdc-01 check_reload_status: Reloading filter Oct 25 11:43:33 fw-lvdc-01 check_reload_status: rc.newwanip starting lagg0 Oct 25 11:43:34 fw-lvdc-01 php-fpm[32701]: /rc.newwanip: rc.newwanip: Info: starting on lagg0. Oct 25 11:43:34 fw-lvdc-01 php-fpm[32701]: /rc.newwanip: rc.newwanip: on (IP address: 10.50.1.1) (interface: MGMT[lan]) (real interface: lagg0). Oct 25 11:43:34 fw-lvdc-01 check_reload_status: Reloading filter
Updated by Michael OBrien about 7 years ago
Was this new ticket opened? When I change LAGG interface settings via the pfSense GUI or a command prompt, my pfSense 2.4.1 box (using igb drivers) cannot ping anything on the LAGG until I completely reboot it.
I think it's this, testing nightly now: https://redmine.pfsense.org/issues/7928
Updated by Steve Wheeler about 7 years ago
If it didn't actually panic it's probably that MAC address issue. That should be fixed in 2.4.2 snaps now. Please report if you're still able to trigger it there.
Updated by Michael OBrien about 7 years ago
Steve Wheeler wrote:
If it didn't actually panic it's probably that MAC address issue. That should be fixed in 2.4.2 snaps now. Please report if you're still able to trigger it there.
Nope, 2.4.2 snapshots fixed it right up. Thanks!