Project

General

Profile

Actions

Bug #13003

closed

Malicious Driver Detection event on ``ixl(4)`` driver

Added by Marcos M over 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Category:
Hardware / Drivers
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
23.05
Release Notes:
Default
Affected Version:
Affected Architecture:

Description

There have been a handful of reports of MDD events happening with the Intel X710 NIC. The system logs show the following:

ixl10: Malicious Driver Detection event 2 on TX queue 7, pf number 0
ixl10: MDD TX event is for this function!
ixl10: WARNING: queue 7 appears to be hung!
ixl10: Malicious Driver Detection event 2 on TX queue 4, pf number 0
ixl10: WARNING: queue 4 appears to be hung!

and

Oct 29 09:47:08 kernel ixl1: Malicious Driver Detection event 2 on TX queue 769, pf number 1 (PF-1)
Oct 29 09:37:28 kernel ixl1: Malicious Driver Detection event 2 on TX queue 773, pf number 1 (PF-1)

and

kernel: ixl0: Malicious Driver Detection event 2 on TX queue 0, pf number 0 (PF-0)

and https://forum.netgate.com/topic/158415/issues-with-an-intel-x710-and-pfsense-2-4-5-p1

Some info gathered from various reports and troubleshooting:
  • Occurs on latest NIC firmware version (as of 2022-07-29).
  • Occurs anywhere from once a day, to once a month.
  • Occurs on pfSense 2.4.5p1 22.01, and 22.05.
  • Occurs with PF traffic (SR-IOV not required to be enabled).
  • Occurs with TSO/LRO disabled.
  • Occurs with copper (RJ-45) and optical transceivers.
  • Most of the issue reports have been from those running a bridge interface with ixl0 and ixl1. However, there have been multiple reports without using bridges as well.
    Increasing the buffer size on the bridge reduced the frequency of the events (went from once a day to taking 5 days before it reoccurred).
Actions #1

Updated by Kris Phillips over 2 years ago

I saw this occur on a 7100 that had two bridged ixl interfaces for an add in card on 21.05.2, so it may affect basically everything from 2.4.5p1 to 22.01, potentially.

Actions #2

Updated by Christoph Vieten over 2 years ago

Same happened on 2.6.0 with Intel x710-T4 multiple times now.
Updating the nvme from 8.15 to latest 8.60 didn't fix the issue. Replacing the card with another X710 didn't help either.

sysctl -a | grep dev.ixl.0 | grep fw
dev.ixl.0.fw_lldp: 1
dev.ixl.0.fw_version: fw 8.6.68629 api 1.15 nvm 8.60 etid 8000bd5a oem 1.268.0

sysctl -a | grep dev.ixl.0.%desc
dev.ixl.0.%desc: Intel(R) Ethernet Controller X710/X557-AT 10GBASE-T - 2.3.1-k

Seems to only affect one port of the 4 ports, seems to be the one with the most traffic.

TSO is disabled by the checkbox and System => Advanced => Tunable => net.inet.tcp.tso set to 0

Actions #3

Updated by Kris Phillips over 2 years ago

Christoph Vieten wrote in #note-2:

Same happened on 2.6.0 with Intel x710-T4 multiple times now.
Updating the nvme from 8.15 to latest 8.60 didn't fix the issue. Replacing the card with another X710 didn't help either.

sysctl -a | grep dev.ixl.0 | grep fw
dev.ixl.0.fw_lldp: 1
dev.ixl.0.fw_version: fw 8.6.68629 api 1.15 nvm 8.60 etid 8000bd5a oem 1.268.0

sysctl -a | grep dev.ixl.0.%desc
dev.ixl.0.%desc: Intel(R) Ethernet Controller X710/X557-AT 10GBASE-T - 2.3.1-k

Seems to only affect one port of the 4 ports, seems to be the one with the most traffic.

TSO is disabled by the checkbox and System => Advanced => Tunable => net.inet.tcp.tso set to 0

Christoph,

Were you running a bridge in your configuration like the original bug report seems to suggest is the root cause?

Actions #4

Updated by Marcos M over 2 years ago

  • Description updated (diff)
Actions #5

Updated by Christoph Vieten over 2 years ago

Kris Phillips wrote in #note-3:

Christoph Vieten wrote in #note-2:

Same happened on 2.6.0 with Intel x710-T4 multiple times now.
Updating the nvme from 8.15 to latest 8.60 didn't fix the issue. Replacing the card with another X710 didn't help either.

sysctl -a | grep dev.ixl.0 | grep fw
dev.ixl.0.fw_lldp: 1
dev.ixl.0.fw_version: fw 8.6.68629 api 1.15 nvm 8.60 etid 8000bd5a oem 1.268.0

sysctl -a | grep dev.ixl.0.%desc
dev.ixl.0.%desc: Intel(R) Ethernet Controller X710/X557-AT 10GBASE-T - 2.3.1-k

Seems to only affect one port of the 4 ports, seems to be the one with the most traffic.

TSO is disabled by the checkbox and System => Advanced => Tunable => net.inet.tcp.tso set to 0

Christoph,

Were you running a bridge in your configuration like the original bug report seems to suggest is the root cause?

Hi Kris,

no, were aren't running a bridge at all. But we are running approx. 20 vlan interfaces on the port that is affected.
Looks like when the issue occurs, you cannot switch to other physical ports (we have three of those X710 quad port cards in use) of any other adapter as well.
But the other ports in use (e.g. some 10g ports are configured without vlan assignments or have a smaller number of vlans) aren't affected by that driver / firmware stuck issue so can still be used.

Last time when the issue occurred, we migrated the top traffic vlan interfaces to separate ports resulting in a longer uptime until yesterday.

Did someone try the latest FreeBSD driver yet?
https://pkg.freebsd.org/FreeBSD:12:amd64/latest/All/intel-ix-kmod-3.3.24.pkg

Actions #6

Updated by Kris Phillips over 2 years ago

Christoph Vieten wrote in #note-5:

Kris Phillips wrote in #note-3:

Christoph Vieten wrote in #note-2:

Same happened on 2.6.0 with Intel x710-T4 multiple times now.
Updating the nvme from 8.15 to latest 8.60 didn't fix the issue. Replacing the card with another X710 didn't help either.

sysctl -a | grep dev.ixl.0 | grep fw
dev.ixl.0.fw_lldp: 1
dev.ixl.0.fw_version: fw 8.6.68629 api 1.15 nvm 8.60 etid 8000bd5a oem 1.268.0

sysctl -a | grep dev.ixl.0.%desc
dev.ixl.0.%desc: Intel(R) Ethernet Controller X710/X557-AT 10GBASE-T - 2.3.1-k

Seems to only affect one port of the 4 ports, seems to be the one with the most traffic.

TSO is disabled by the checkbox and System => Advanced => Tunable => net.inet.tcp.tso set to 0

Christoph,

Were you running a bridge in your configuration like the original bug report seems to suggest is the root cause?

Hi Kris,

no, were aren't running a bridge at all. But we are running approx. 20 vlan interfaces on the port that is affected.
Looks like when the issue occurs, you cannot switch to other physical ports (we have three of those X710 quad port cards in use) of any other adapter as well.
But the other ports in use (e.g. some 10g ports are configured without vlan assignments or have a smaller number of vlans) aren't affected by that driver / firmware stuck issue so can still be used.

Last time when the issue occurred, we migrated the top traffic vlan interfaces to separate ports resulting in a longer uptime until yesterday.

Did someone try the latest FreeBSD driver yet?
https://pkg.freebsd.org/FreeBSD:12:amd64/latest/All/intel-ix-kmod-3.3.24.pkg

Hello Christoph,

I don't see any notes that it's been tested for this particular issue. However, the Intel ix driver was updated in 22.05. Have you tested to see if this issue is gone in the latest RC? We expect 22.05 to be released very soon, so might be worth a re-test on the latest.

Actions #7

Updated by Marcos M over 2 years ago

  • Description updated (diff)
Actions #8

Updated by Daniel Montealvaro over 1 year ago

Good afternoon.

We have the same problem with our 1541.

We are in version 23.01.

The problem is that sometimes the IXL3 interface begins to block all traffic until we restart the firewall or the IXL3 interface.
In the system record we see the next line just when this happens:
IXL3: Malicious Driver Detection event 2 on TX queue 1152, pf number 3 (PF-3)

We have already tried to disable TSO (both in "System/Advanced/Networking" and "Tunables System"), but this has not solved the problem.

IXL0 and IXL3 interfaces are on a bridge.

Greetings.

Actions #9

Updated by Daniel Montealvaro over 1 year ago

Good afternoon.

We have tried updating the driver, disabling TSO, increasing queues, changing the interface... Without any results, we continue with daily crashes.

Mar 6 15:25:35 pfs01 kernel: ixl3: Malicious Driver Detection event 2 on TX queue 1152, pf number 3 (PF-3)
Mar 7 13:50:24 pfs01 kernel: ixl3: Malicious Driver Detection event 2 on TX queue 1152, pf number 3 (PF-3)
Mar 7 14:48:39 pfs01 kernel: ixl3: Malicious Driver Detection event 2 on TX queue 1152, pf number 3 (PF-3)
Mar 8 08:29:40 pfs01 kernel: ixl3: Malicious Driver Detection event 2 on TX queue 1152, pf number 3 (PF-3)
Mar 8 11:38:56 pfs01 kernel: ixl3: Malicious Driver Detection event 2 on TX queue 1152, pf number 3 (PF-3)
Mar 9 13:48:16 pfs01 kernel: ixl2: Malicious Driver Detection event 2 on TX queue 768, pf number 2 (PF-2)
Mar 10 12:20:29 pfs01 kernel: ixl2: Malicious Driver Detection event 2 on TX queue 768, pf number 2 (PF-2)
Mar 10 16:30:13 pfs01 kernel: ixl2: Malicious Driver Detection event 2 on TX queue 768, pf number 2 (PF-2)
Mar 10 16:39:17 pfs01 kernel: ixl2: Malicious Driver Detection event 2 on TX queue 768, pf number 2 (PF-2)

The data of our NIC are the following:
dev.ixl.0.fw_version: fw 6.0.48442 api 1.7 nvm 6.01 etid 8000351b oem 0.0.0
dev.ixl.0.iflib.driver_version: 2.3.2-k
dev.ixl.0.%pnpinfo: vendor=0x8086 device=0x1572 subvendor=0x8086 subdevice=0x0001 class=0x020000
dev.ixl.0.%location: slot=0 function=0 dbsf=pci0:5:0:0 handle=\_SB_.PCI0.BR3A.H000
dev.ixl.0.%driver: ixl
dev.ixl.0.%desc: Intel(R) Ethernet Controller X710 for 10GbE SFP+ - 2.3.2-k

The 1541 is at version 23.01-RELEASE.

I don't understand that the answer from NETGATE support is the following, I find it unfortunate:

Marcos
4 days ago

Hello,

It seems like you are experiencing the issue reported here:
https://redmine.pfsense.org/issues/13003

Unfortunately we do not have a workaround to provide. You may check to see if there are newer NIC firmwares available, or potentially use another vendor / NIC that uses different drivers. In either case, I suggest adding your report as a comment there and clarify that this is still an issue on 23.01 (I assume that's correct given that's what was specified on the ticket).

- - -
Marcos M.
Netgate Global Support

Actions #10

Updated by Daniel Montealvaro over 1 year ago

Today we have had a crash with the "Malicious Driver Detection" event at 10:00:26 Colombia time:
Mar 13 10:00:26 kernel ixl2: Malicious Driver Detection event 2 on TX queue 768, pf number 2 (PF-2)

The system LOG does not reflect anything conclusive:
Mar 13 10:05:59 check_reload_status 390 Reloading filter
Mar 13 10:05:59 php-fpm 360 /rc.linkup: Ignoring link event for bridge member without IP address configuration
Mar 13 10:05:58 check_reload_status 390 Reloading filter
Mar 13 10:05:58 php-fpm 360 /rc.linkup: Ignoring link event for bridge member without IP address configuration
Mar 13 10:05:58 kernel ixl2: link state changed to UP
Mar 13 10:05:58 kernel ixl2: Link is up, 10 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: False, Flow Control: None
Mar 13 10:05:58 check_reload_status 390 Linkup starting ixl2
Mar 13 10:05:57 kernel ixl2: link state changed to DOWN
Mar 13 10:05:57 check_reload_status 390 Linkup starting ixl2
Mar 13 10:05:38 php-fpm 6960 /index.php: Successful login for user 'admin' from: 47.62.22.86 (Local Database)
Mar 13 10:05:35 php-fpm 6960 /index.php: Session timed out for user 'admin' from: 47.62.22.86 (Local Database)
Mar 13 10:00:26 kernel ixl2: Malicious Driver Detection event 2 on TX queue 768, pf number 2 (PF-2)
Mar 13 08:29:00 sshguard 47986 Now monitoring attacks.
Mar 13 08:29:00 sshguard 81272 Exiting on signal.

Greetings.

Actions #11

Updated by Kristof Provost over 1 year ago

  • Status changed from New to Ready To Test

As we've not been able to reproduce this issue the best we can do (and have done) for now is to disable the malicious driver detection mechanism. That change will appear in the next snapshot builds.

Actions #12

Updated by Jim Pingle over 1 year ago

  • Subject changed from Malicious Driver Detection event on ixl driver to Malicious Driver Detection event on ``ixl(4)`` driver
  • Status changed from Ready To Test to Feedback
  • Assignee set to Kristof Provost
  • Target version set to 2.7.0
  • % Done changed from 0 to 100
  • Plus Target Version set to 23.05
Actions #13

Updated by Jim Pingle over 1 year ago

  • Status changed from Feedback to Resolved
Actions #14

Updated by Marcos M over 1 year ago

  • Status changed from Resolved to Closed
Actions

Also available in: Atom PDF