Bug #15601: Routes with IPv6 Address as Next Hop for IPv4 Destination Causes Kernel Panic - pfSense - pfSense bugtracker

Actions

Copy link

Bug #15601

closed

Routes with IPv6 Address as Next Hop for IPv4 Destination Causes Kernel Panic

Added by Kris Phillips 10 months ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Mateusz Guzik

Category:

Routing

Target version:

2.8.0

Start date:

Due date:

% Done:

100%

Estimated time:

Plus Target Version:

24.11

Release Notes:

Default

Affected Version:

All

Affected Architecture:

All

Description

If an entry is able to be made that adds a route for IPv4 traffic to be sent to an IPv6 destination, this can cause a page fault kernel panic and crash.

Actions

Copy link

Updated by Jim Pingle 9 months ago

Project changed from pfSense Plus to pfSense
Category changed from SNMP to Routing
Status changed from New to Feedback
Affected Plus Version deleted (~~24.03~~)

How exactly is someone making that sort of entry? It can't be made in the GUI via static routes, input validation rejects it. It can't be made at the CLI, the route command rejects it.

Actions

Copy link

Updated by Kristof Provost 9 months ago

The relevant bits from the (private) crash dump is this:

db:0:kdb.enter.default>  run pfs
db:1:pfs> bt
Tracing pid 12 tid 100120 td 0xfffff80005b91000
kdb_enter() at kdb_enter+0x33/frame 0xfffffe0106720800
panic() at panic+0x43/frame 0xfffffe0106720860
trap_fatal() at trap_fatal+0x40f/frame 0xfffffe01067208c0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0106720920
calltrap() at calltrap+0x8/frame 0xfffffe0106720920
--- trap 0xc, rip = 0xffffffff80d5ab70, rsp = 0xfffffe01067209f0, rbp = 0xfffffe0106720a00 ---
turnstile_broadcast() at turnstile_broadcast+0x40/frame 0xfffffe0106720a00
__rw_wunlock_hard() at __rw_wunlock_hard+0x9e/frame 0xfffffe0106720a30
nd6_resolve_slow() at nd6_resolve_slow+0x2d7/frame 0xfffffe0106720aa0
nd6_resolve() at nd6_resolve+0x125/frame 0xfffffe0106720b10
ether_output() at ether_output+0x4e7/frame 0xfffffe0106720ba0
ip_output_send() at ip_output_send+0xdc/frame 0xfffffe0106720be0
ip_output() at ip_output+0x1295/frame 0xfffffe0106720ce0
ip_forward() at ip_forward+0x3c2/frame 0xfffffe0106720d90
ip_input() at ip_input+0x705/frame 0xfffffe0106720df0
swi_net() at swi_net+0x138/frame 0xfffffe0106720e60
ithread_loop() at ithread_loop+0x257/frame 0xfffffe0106720ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe0106720f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0106720f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

addr2line points us to this code section in nd6_resolve_slow():

2455   │     /* If we have child lle, switch to the parent to send NS */
2456   │     if (lle->la_flags & LLE_CHILD) {
2457   │         struct llentry *lle_parent = lle->lle_parent;
2458   │         LLE_WUNLOCK(lle);
2459   │         lle = lle_parent;
2460   │         LLE_WLOCK(lle);
2461   │     }

The crash happens on the lock of the parent lie on line 2460. The most probably reason for this is a race between this code and unlinking of the child/parent lle. I believe we should be acquiring the parent lock before we release the child lock.

Actions

Copy link

Updated by Kristof Provost 9 months ago

Jim Pingle wrote in #note-2:

How exactly is someone making that sort of entry? It can't be made in the GUI via static routes, input validation rejects it. It can't be made at the CLI, the route command rejects it.

I have this in my test case to at least run the relevant code path:

route add -6 -net -inet 0.0.0.0/0 -inet6 2001:db8::1

The customer's routing table also has entries like this:

10.0.0.0/24        2001:db8:42::3 UG1     21   1500   lagg0.10

Actions

Copy link

Updated by Kristof Provost 9 months ago

I've proposed this upstream: https://reviews.freebsd.org/D45913 and copied the original author of the relevant code.

Actions

Copy link

Updated by Jim Pingle 9 months ago

Status changed from Feedback to In Progress
Assignee set to Kristof Provost

Actions

Copy link

Updated by Jim Pingle 9 months ago

Target version set to 2.8.0
Plus Target Version set to 24.08
Affected Version set to All

Actions

Copy link

Updated by Kris Phillips 9 months ago

Jim Pingle wrote in #note-2:

How exactly is someone making that sort of entry? It can't be made in the GUI via static routes, input validation rejects it. It can't be made at the CLI, the route command rejects it.

This route was added by FRR BGP learning a route.

Actions

Copy link

Updated by Mateusz Guzik 9 months ago

Note that these IPs like to be one instruction off. The __rw_wunlock_hard is just prior and it operates on the child -- the parent was not looked at yet. Therefore it is the child which failed to unlock.

Normally a panic like this means the value of the lock itself is corrupted -- the fast path fails and the fallback expects there are blocked threads waiting to be woken up. The crash stems from failing to find any.

For the buggy state to occur something had to damage the lock or there is a bug in locking primitives (I'm ruling out the latter though).

Would the customer be willing to run a kernel with certain debug facilities added? Performance should be about the same, but it should also shed a light on what's going on here.

I can prep everything tomorrow. It is very easy to plop a new kernel in, but I don't know if there is a blessed way here or there is some hand-holding for the customer needed. I'm counting on the support team here.

Actions

Copy link

#10

Updated by Mateusz Guzik 9 months ago

Assignee changed from Kristof Provost to Mateusz Guzik

Actions

Copy link

#12

Updated by Jim Pingle 6 months ago

Plus Target Version changed from 24.08 to 24.11

Actions

Copy link

#13

Updated by Mateusz Guzik 6 months ago

The customer was shipped with 2 kernels. First added some debug and another added a workaround for the suspected issue.

The customer claims the crashes stopped and it was confirmed they are running the kernel variant which was expected to fix the issue.

However, they had a period of time where they were running the debug kernel which was expected to crash and did not (it did crash eventually).

Meaning we don't know for sure whether the problem is mitigated. The good news is that the mitigation is harmless, thus it landed for the time being: https://gitlab.netgate.com/pfSense/FreeBSD-src/-/commit/5b6ba89cd18f370f42c72e09c750e6ae5bc9a0a6 . It is going to point out in dmesg that it had to be used.

Actions

Copy link

#14

Updated by Jim Pingle 6 months ago

Status changed from In Progress to Feedback
% Done changed from 0 to 100

Actions

Copy link

#15

Updated by Jim Pingle 5 months ago

Status changed from Feedback to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

pfSense

Custom queries

Bug #15601

Routes with IPv6 Address as Next Hop for IPv4 Destination Causes Kernel Panic

Updated by Jim Pingle 9 months ago

Updated by Kristof Provost 9 months ago

Updated by Kristof Provost 9 months ago

Updated by Kristof Provost 9 months ago

Updated by Jim Pingle 9 months ago

Updated by Jim Pingle 9 months ago

Updated by Kris Phillips 9 months ago

Updated by Mateusz Guzik 9 months ago

Updated by Mateusz Guzik 9 months ago

Updated by Jim Pingle 6 months ago

Updated by Mateusz Guzik 6 months ago

Updated by Jim Pingle 6 months ago

Updated by Jim Pingle 5 months ago