Project

General

Profile

Actions

Bug #7119

closed

Changing LAGG attributes results in a panic/crash

Added by Jim Pingle almost 8 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Interfaces
Target version:
Start date:
01/13/2017
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
2.4
Affected Architecture:
amd64

Description

On 2.4, when changing attributes of an assigned LAGG such as the mode or membership, the firewall panics and reboots.

Tested on an 8860 and 4860, so it may be specific to igb. In this case, the lagg instance contained igb4,igb5 in LACP mode, and I attempted to change the mode to Failover. bjaffe encountered the same crash when changing member interfaces.

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address    = 0x0
fault code        = supervisor read data, page not present
instruction pointer    = 0x20:0xffffffff80e190c0
stack pointer            = 0x28:0xfffffe022c32fa30
frame pointer            = 0x28:0xfffffe022c32fa50
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 12 (swi6: task queue)
db:0:kdb.enter.default>  show pcpu
cpuid        = 2
dynamic pcpu = 0xfffffe02a9c86f00
curthread    = 0xfffff80006250500: pid 12 "swi6: task queue" 
curpcb       = 0xfffffe022c32fcc0
fpcurthread  = none
idlethread   = 0xfffff80006233500: tid 100005 "idle: cpu2" 
curpmap      = 0xffffffff829e5600
tssp         = 0xffffffff82a1dee0
commontssp   = 0xffffffff82a1dee0
rsp0         = 0xfffffe022c32fcc0
gs32p        = 0xffffffff82a24738
ldt          = 0xffffffff82a24778
tss          = 0xffffffff82a24768
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100023 td 0xfffff80006250500
arp_iflladdr() at arp_iflladdr+0x10/frame 0xfffffe022c32fa50
lagg_port_setlladdr() at lagg_port_setlladdr+0x14e/frame 0xfffffe022c32faa0
taskqueue_run_locked() at taskqueue_run_locked+0x14a/frame 0xfffffe022c32fb00
taskqueue_run() at taskqueue_run+0xbf/frame 0xfffffe022c32fb20
intr_event_execute_handlers() at intr_event_execute_handlers+0x20f/frame 0xfffffe022c32fb60
ithread_loop() at ithread_loop+0xc6/frame 0xfffffe022c32fbb0
fork_exit() at fork_exit+0x85/frame 0xfffffe022c32fbf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe022c32fbf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Actions #1

Updated by Rolf Sommerhalder almost 8 years ago

Jim Pingle wrote:

On 2.4, when changing attributes of an assigned LAGG such as the mode or membership, the firewall panics and reboots.

Tested on an 8860 and 4860, so it may be specific to igb. In this case, the lagg instance contained igb4,igb5 in LACP mode, and I attempted to change the mode to Failover. bjaffe encountered the same crash when changing member interfaces.

With 2.4 amd64 Snapshot on Supermicro SuperServers 5018D-FN8T with X10SDV-TP8F motherboards, for example changing an IP address of a VLAN on LAGG interfaces igb1,igb2,igb3 that uses LACP also panics, and the kernel hangs subsequently.

It requires a manual Reset or Power Cycle, using BMC/IPMI from remote. Fortunately it will restart, and the changes will then take effect.

For such situations, getting the Watch Dog to work would be helpful, which is available in the BIOS...

Actions #2

Updated by Renato Botelho almost 8 years ago

  • Status changed from New to Feedback
  • Assignee set to Renato Botelho
  • % Done changed from 0 to 100
Actions #3

Updated by Jim Pingle almost 8 years ago

  • Status changed from Feedback to Confirmed

Still crashes on the latest factory snapshot: Wed Jan 18 19:49:46 CST 2017

Actions #4

Updated by Renato Botelho almost 8 years ago

I couldn't reproduce it on a VM using em driver, probably something specific to igb as mentioned

Actions #5

Updated by Rolf Sommerhalder almost 8 years ago

Snapshots from this morning still crash with igb hardware NICs.

Actions #6

Updated by Rolf Sommerhalder almost 8 years ago

To be more precise: pfSense does not exactly "crash", as it is still ping-able. And SSH shells that were open from before the "crash" remain connected, while still being able to type commands, but do not return answers.

Only reset or power-cycle gets it out of this state (did not managed to get Watch Dog working yet).
Thereafter, the changes made to LAGG right before the "crash" take effect.

Actions #7

Updated by Jim Pingle almost 8 years ago

Here, it still panics + dumps + reboots same as it did originally.

Actions #8

Updated by Renato Botelho almost 8 years ago

  • Assignee changed from Renato Botelho to Luiz Souza
Actions #10

Updated by Jim Pingle almost 8 years ago

Seems better now, it doesn't crash. Logs of activity in the log, though:

Jan 27 19:47:40 master snmpd[47102]: SIOCGIFDESCR (lagg0): Device not configured
Jan 27 19:47:40 master kernel: igb4: lagg_port_destroy: lp_ifflags unclean
Jan 27 19:47:40 master kernel: igb5: lagg_port_destroy: lp_ifflags unclean
Jan 27 19:47:40 master kernel: lagg0: promiscuous mode disabled
Jan 27 19:47:40 master check_reload_status: Linkup starting lagg0
Jan 27 19:47:40 master kernel: lagg0: link state changed to DOWN
Jan 27 19:47:40 master check_reload_status: Syncing firewall
Jan 27 19:47:40 master php-fpm[43135]: /interfaces_lagg_edit.php: Beginning https://portal.pfsense.org configuration backup.
Jan 27 19:47:41 master check_reload_status: Reloading filter
Jan 27 19:47:43 master php-fpm[43135]: /interfaces_lagg_edit.php: End of portal.pfsense.org configuration backup (success).
Jan 27 19:47:43 master snmpd[47102]: SIOCGIFDESCR (lagg0_vlan10): Device not configured
Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:43 master kernel: carp: demoted by -240 to 240 (vhid removed)
Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:43 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:43 master kernel: carp: demoted by -240 to 0 (vhid removed)
Jan 27 19:47:43 master kernel: lagg0_vlan10: promiscuous mode disabled
Jan 27 19:47:43 master kernel: vlan0: changing name to 'lagg0_vlan10'
Jan 27 19:47:43 master snmpd[47102]: SIOCGIFDESCR (lagg0_vlan10): Device not configured
Jan 27 19:47:43 master snmpd[47102]: SIOCGIFDESCR (vlan0): Device not configured
Jan 27 19:47:43 master kernel: lagg0: promiscuous mode enabled
Jan 27 19:47:43 master kernel: lagg0_vlan10: promiscuous mode enabled
Jan 27 19:47:43 master check_reload_status: Restarting ipsec tunnels
Jan 27 19:47:43 master kernel: carp: demoted by 240 to 240 (interface down)
Jan 27 19:47:43 master kernel: carp: demoted by 240 to 480 (interface down)
Jan 27 19:47:45 master check_reload_status: updating dyndns opt2
Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:45 master kernel: carp: demoted by -240 to 240 (vhid removed)
Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:45 master kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0_vlan10: 3
Jan 27 19:47:45 master kernel: carp: demoted by -240 to 0 (vhid removed)
Jan 27 19:47:45 master kernel: lagg0: promiscuous mode disabled
Jan 27 19:47:45 master kernel: lagg0_vlan10: promiscuous mode disabled
Jan 27 19:47:46 master snmpd[47102]: SIOCGIFDESCR (lagg0_vlan20): Device not configured
Jan 27 19:47:46 master kernel: lagg0: promiscuous mode enabled
Jan 27 19:47:46 master kernel: lagg0_vlan10: promiscuous mode enabled
Jan 27 19:47:46 master kernel: carp: demoted by 240 to 240 (interface down)
Jan 27 19:47:46 master kernel: carp: demoted by 240 to 480 (interface down)
Jan 27 19:47:46 master kernel: vlan1: changing name to 'lagg0_vlan20'
Jan 27 19:47:46 master snmpd[47102]: SIOCGIFDESCR (vlan1): Device not configured
Jan 27 19:47:59 master php-fpm[94047]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
Jan 27 19:47:59 master check_reload_status: Reloading filter

If that is normal/expected then we can close this.

Actions #11

Updated by Luiz Souza almost 8 years ago

Yes, the messages does not seem related with the original bug (crash at ifconfig laggX destroy).

Let's open a new ticket to track these warnings.

Actions #12

Updated by Luiz Souza almost 8 years ago

  • Status changed from Feedback to Resolved
Actions #13

Updated by Michael OBrien about 7 years ago

Luiz Souza wrote:

Yes, the messages does not seem related with the original bug (crash at ifconfig laggX destroy).

Let's open a new ticket to track these warnings.

Was this new ticket opened? When I change LAGG interface settings via the pfSense GUI or a command prompt, my pfSense 2.4.1 box (using igb drivers) cannot ping anything on the LAGG until I completely reboot it.

Nothing interesting in dmesg. Here's what shows up in system.log - you'll see a lot of sync noise, but this happened before HA was configured as well.

Oct 25 11:43:26 fw-lvdc-01 check_reload_status: Syncing firewall
Oct 25 11:43:27 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.0.2:443/xmlrpc.php.
Oct 25 11:43:27 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: XMLRPC reload data success with https://172.16.0.2:443/xmlrpc.php (pfsense.host_firmware_version).
Oct 25 11:43:27 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: XMLRPC versioncheck: 17.3 -- 17.3
Oct 25 11:43:27 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.0.2:443/xmlrpc.php.
Oct 25 11:43:28 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: XMLRPC reload data success with https://172.16.0.2:443/xmlrpc.php (pfsense.restore_config_section).
Oct 25 11:43:28 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.0.2:443/xmlrpc.php.
Oct 25 11:43:28 fw-lvdc-01 check_reload_status: Linkup starting igb2
Oct 25 11:43:28 fw-lvdc-01 kernel: igb2: link state changed to DOWN
Oct 25 11:43:28 fw-lvdc-01 kernel: igb3: link state changed to DOWN
Oct 25 11:43:28 fw-lvdc-01 kernel: lagg0: link state changed to DOWN
Oct 25 11:43:28 fw-lvdc-01 check_reload_status: Restarting ipsec tunnels
Oct 25 11:43:28 fw-lvdc-01 check_reload_status: Linkup starting igb3
Oct 25 11:43:28 fw-lvdc-01 check_reload_status: Linkup starting lagg0
Oct 25 11:43:29 fw-lvdc-01 check_reload_status: Reloading filter
Oct 25 11:43:29 fw-lvdc-01 check_reload_status: Reloading filter
Oct 25 11:43:29 fw-lvdc-01 php-fpm[89611]: /rc.linkup: Hotplug event detected for MGMT(lan) static IP (10.50.1.1 )
Oct 25 11:43:30 fw-lvdc-01 check_reload_status: updating dyndns lan
Oct 25 11:43:31 fw-lvdc-01 php-fpm[52624]: /rc.filter_synchronize: XMLRPC reload data success with https://172.16.0.2:443/xmlrpc.php (pfsense.filter_configure).
Oct 25 11:43:32 fw-lvdc-01 check_reload_status: Linkup starting igb2
Oct 25 11:43:32 fw-lvdc-01 kernel: igb2: link state changed to UP
Oct 25 11:43:32 fw-lvdc-01 kernel: lagg0: link state changed to UP
Oct 25 11:43:32 fw-lvdc-01 check_reload_status: Linkup starting lagg0
Oct 25 11:43:32 fw-lvdc-01 check_reload_status: Linkup starting igb3
Oct 25 11:43:32 fw-lvdc-01 kernel: igb3: link state changed to UP
Oct 25 11:43:32 fw-lvdc-01 check_reload_status: Reloading filter
Oct 25 11:43:32 fw-lvdc-01 php-fpm[87800]: /interfaces.php: Creating rrd update script
Oct 25 11:43:33 fw-lvdc-01 php-fpm[87800]: /rc.linkup: Hotplug event detected for MGMT(lan) static IP (10.50.1.1 )
Oct 25 11:43:33 fw-lvdc-01 check_reload_status: Reloading filter
Oct 25 11:43:33 fw-lvdc-01 check_reload_status: rc.newwanip starting lagg0
Oct 25 11:43:34 fw-lvdc-01 php-fpm[32701]: /rc.newwanip: rc.newwanip: Info: starting on lagg0.
Oct 25 11:43:34 fw-lvdc-01 php-fpm[32701]: /rc.newwanip: rc.newwanip: on (IP address: 10.50.1.1) (interface: MGMT[lan]) (real interface: lagg0).
Oct 25 11:43:34 fw-lvdc-01 check_reload_status: Reloading filter
Actions #14

Updated by Michael OBrien about 7 years ago

Was this new ticket opened? When I change LAGG interface settings via the pfSense GUI or a command prompt, my pfSense 2.4.1 box (using igb drivers) cannot ping anything on the LAGG until I completely reboot it.

I think it's this, testing nightly now: https://redmine.pfsense.org/issues/7928

Actions #15

Updated by Steve Wheeler about 7 years ago

If it didn't actually panic it's probably that MAC address issue. That should be fixed in 2.4.2 snaps now. Please report if you're still able to trigger it there.

Actions #16

Updated by Michael OBrien about 7 years ago

Steve Wheeler wrote:

If it didn't actually panic it's probably that MAC address issue. That should be fixed in 2.4.2 snaps now. Please report if you're still able to trigger it there.

Nope, 2.4.2 snapshots fixed it right up. Thanks!

Actions

Also available in: Atom PDF