Regression #14180
openConnectX-4 LX MCX4121A-ACAT - VT-d passthrough of both ports, virtualized pfSense fails to boot due to mlx5 driver errors
0%
Description
I've been running the following configuration for months now:
Hypervisor:
Linux Kernel 5.15
libvirt/qemu/kvm
pfSense VM:
i440fx
VT-d passthrough of both ports of MCX4121A-ACAT
IOMMU/ACS are all fine on the Supermicro server mainboard X11SPL-F
After updating from CE 2.6.0 to Plus 23.01 it's not working anymore.
Libvirt successfully starts the VM with the PCI devices (both ports of the network adapter) passed through (VT-d).
At kernel bootup I see these error messages:
mlx5_core1: WARN: wait_func:967:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: give_pages:354:(pid 0): func_id 0x0, npages 1241, err -60
mlx5_core1: WARN: wait_func:967:(pid 0): CREATE_EQ(0x301) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: wait_func:967:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: mlx5_destroy_unmap_eq:523:(pid 0): failed to destroy a previously created eq: eqn 7
mlx5_core1: WARN: wait_func:967:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: give_pages:375:(pid 0): page notify failed
mlx5_core1: WARN: free_comp_eqs:671:(pid 0): failed to destroy EQ 0x7
mlx5_core1: WARN: pages_work_handler:475:(pid 0): give fail -60
mlx5_core1: WARN: wait_func:967:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
mlx5_core1: ERR: reclaim_pages:444:(pid 0): failed reclaiming pages
mlx5_core1: WARN: pages_work_handler:475:(pid 0): reclaim fail -60
mlx5_core1: WARN: wait_func:967:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: mlx5_destroy_unmap_eq:523:(pid 0): failed to destroy a previously created eq: eqn 8
mlx5_core1: WARN: free_comp_eqs:671:(pid 0): failed to destroy EQ 0x8
mlx5_core1: WARN: wait_func:967:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
mlx5_core1: ERR: reclaim_pages:444:(pid 0): failed reclaiming pages
mlx5_core1: WARN: pages_work_handler:475:(pid 0): reclaim fail -60
mlx5_core1: WARN: wait_func:967:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: mlx5_destroy_unmap_eq:523:(pid 0): failed to destroy a previously created eq: eqn 9
mlx5_core1: WARN: free_comp_eqs:671:(pid 0): failed to destroy EQ 0x9
mlx5_core1: WARN: wait_func:967:(pid 0): DEALLOC_UAR(0x803) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: up_rel_func:89:(pid 0): failed to free uar index 16
Sometimes it boots fine, sometimes the error messages appear and it never progresses to the part where the actual OS starts, and sometimes it reaches the part where pfSense starts, but then the network interfaces aren't available and it asks me to manually reassign the configuration interfaces.
This makes it unusable for me at the moment.
Updated by Jim Pingle over 1 year ago
- Status changed from New to Feedback
The error messages are different so this may not be the case, but over on the TNSR side we have seen behavior changes in MLX relative to specific OS/driver versions and the firmware on the MLX cards. For example upgrading to a new OS version may mean either upgrading or downgrading the firmware on the MLX card.
It might be that the way FreeBSD 14 is talking to the card it wants a different version of the MLX firmware on there and there may not be anything actionable in the OS that can solve it.
If you do find that is the case, we can note it in the docs somewhere as we did for TNSR .
Updated by Jordan G about 1 year ago
see if it makes any difference booting EFI with your setup - https://docs.netgate.com/pfsense/en/latest/recipes/virtualize-proxmox-ve.html#booting-uefi
Updated by name name about 1 year ago
Hi, thanks for looking into it.
My setup was already EFI-based. I've long since abandoned the Mellanox card and am using a X710-DA2/4, depending on my system. The newer Intel E810 models had problems as well under FreeBSD 14.
As I have neither time nor will to mess with my production system at the moment, I can't really help you here. Perhaps the problem went away with newer pfSense+ versions, perhaps not, I haven't tested it.