Project

General

Profile

Actions

Regression #14180

open

ConnectX-4 LX MCX4121A-ACAT - VT-d passthrough of both ports, virtualized pfSense fails to boot due to mlx5 driver errors

Added by name name about 1 year ago. Updated 5 months ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Category:
Hardware / Drivers
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Release Notes:
Default
Affected Plus Version:
23.01
Affected Architecture:
amd64

Description

I've been running the following configuration for months now:

Hypervisor:

Linux Kernel 5.15
libvirt/qemu/kvm

pfSense VM:
i440fx
VT-d passthrough of both ports of MCX4121A-ACAT
IOMMU/ACS are all fine on the Supermicro server mainboard X11SPL-F

After updating from CE 2.6.0 to Plus 23.01 it's not working anymore.

Libvirt successfully starts the VM with the PCI devices (both ports of the network adapter) passed through (VT-d).

At kernel bootup I see these error messages:

mlx5_core1: WARN: wait_func:967:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: give_pages:354:(pid 0): func_id 0x0, npages 1241, err -60
mlx5_core1: WARN: wait_func:967:(pid 0): CREATE_EQ(0x301) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: wait_func:967:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: mlx5_destroy_unmap_eq:523:(pid 0): failed to destroy a previously created eq: eqn 7
mlx5_core1: WARN: wait_func:967:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: give_pages:375:(pid 0): page notify failed
mlx5_core1: WARN: free_comp_eqs:671:(pid 0): failed to destroy EQ 0x7
mlx5_core1: WARN: pages_work_handler:475:(pid 0): give fail -60
mlx5_core1: WARN: wait_func:967:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
mlx5_core1: ERR: reclaim_pages:444:(pid 0): failed reclaiming pages
mlx5_core1: WARN: pages_work_handler:475:(pid 0): reclaim fail -60
mlx5_core1: WARN: wait_func:967:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: mlx5_destroy_unmap_eq:523:(pid 0): failed to destroy a previously created eq: eqn 8
mlx5_core1: WARN: free_comp_eqs:671:(pid 0): failed to destroy EQ 0x8
mlx5_core1: WARN: wait_func:967:(pid 0): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
mlx5_core1: ERR: reclaim_pages:444:(pid 0): failed reclaiming pages
mlx5_core1: WARN: pages_work_handler:475:(pid 0): reclaim fail -60
mlx5_core1: WARN: wait_func:967:(pid 0): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: mlx5_destroy_unmap_eq:523:(pid 0): failed to destroy a previously created eq: eqn 9
mlx5_core1: WARN: free_comp_eqs:671:(pid 0): failed to destroy EQ 0x9
mlx5_core1: WARN: wait_func:967:(pid 0): DEALLOC_UAR(0x803) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: up_rel_func:89:(pid 0): failed to free uar index 16

Sometimes it boots fine, sometimes the error messages appear and it never progresses to the part where the actual OS starts, and sometimes it reaches the part where pfSense starts, but then the network interfaces aren't available and it asks me to manually reassign the configuration interfaces.

This makes it unusable for me at the moment.

Actions #1

Updated by Jim Pingle about 1 year ago

  • Status changed from New to Feedback

The error messages are different so this may not be the case, but over on the TNSR side we have seen behavior changes in MLX relative to specific OS/driver versions and the firmware on the MLX cards. For example upgrading to a new OS version may mean either upgrading or downgrading the firmware on the MLX card.

It might be that the way FreeBSD 14 is talking to the card it wants a different version of the MLX firmware on there and there may not be anything actionable in the OS that can solve it.

If you do find that is the case, we can note it in the docs somewhere as we did for TNSR .

Actions #2

Updated by Jordan G 6 months ago

see if it makes any difference booting EFI with your setup - https://docs.netgate.com/pfsense/en/latest/recipes/virtualize-proxmox-ve.html#booting-uefi

Actions #3

Updated by name name 5 months ago

Hi, thanks for looking into it.

My setup was already EFI-based. I've long since abandoned the Mellanox card and am using a X710-DA2/4, depending on my system. The newer Intel E810 models had problems as well under FreeBSD 14.

As I have neither time nor will to mess with my production system at the moment, I can't really help you here. Perhaps the problem went away with newer pfSense+ versions, perhaps not, I haven't tested it.

Actions

Also available in: Atom PDF