Bug #15601
openRoutes with IPv6 Address as Next Hop for IPv4 Destination Causes Kernel Panic
100%
Description
If an entry is able to be made that adds a route for IPv4 traffic to be sent to an IPv6 destination, this can cause a page fault kernel panic and crash.
Updated by Jim Pingle 5 months ago
- Project changed from pfSense Plus to pfSense
- Category changed from SNMP to Routing
- Status changed from New to Feedback
- Affected Plus Version deleted (
24.03)
How exactly is someone making that sort of entry? It can't be made in the GUI via static routes, input validation rejects it. It can't be made at the CLI, the route command rejects it.
Updated by Kristof Provost 5 months ago
The relevant bits from the (private) crash dump is this:
db:0:kdb.enter.default> run pfs db:1:pfs> bt Tracing pid 12 tid 100120 td 0xfffff80005b91000 kdb_enter() at kdb_enter+0x33/frame 0xfffffe0106720800 panic() at panic+0x43/frame 0xfffffe0106720860 trap_fatal() at trap_fatal+0x40f/frame 0xfffffe01067208c0 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0106720920 calltrap() at calltrap+0x8/frame 0xfffffe0106720920 --- trap 0xc, rip = 0xffffffff80d5ab70, rsp = 0xfffffe01067209f0, rbp = 0xfffffe0106720a00 --- turnstile_broadcast() at turnstile_broadcast+0x40/frame 0xfffffe0106720a00 __rw_wunlock_hard() at __rw_wunlock_hard+0x9e/frame 0xfffffe0106720a30 nd6_resolve_slow() at nd6_resolve_slow+0x2d7/frame 0xfffffe0106720aa0 nd6_resolve() at nd6_resolve+0x125/frame 0xfffffe0106720b10 ether_output() at ether_output+0x4e7/frame 0xfffffe0106720ba0 ip_output_send() at ip_output_send+0xdc/frame 0xfffffe0106720be0 ip_output() at ip_output+0x1295/frame 0xfffffe0106720ce0 ip_forward() at ip_forward+0x3c2/frame 0xfffffe0106720d90 ip_input() at ip_input+0x705/frame 0xfffffe0106720df0 swi_net() at swi_net+0x138/frame 0xfffffe0106720e60 ithread_loop() at ithread_loop+0x257/frame 0xfffffe0106720ef0 fork_exit() at fork_exit+0x7f/frame 0xfffffe0106720f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0106720f30 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
addr2line points us to this code section in nd6_resolve_slow():
2455 │ /* If we have child lle, switch to the parent to send NS */ 2456 │ if (lle->la_flags & LLE_CHILD) { 2457 │ struct llentry *lle_parent = lle->lle_parent; 2458 │ LLE_WUNLOCK(lle); 2459 │ lle = lle_parent; 2460 │ LLE_WLOCK(lle); 2461 │ }
The crash happens on the lock of the parent lie on line 2460. The most probably reason for this is a race between this code and unlinking of the child/parent lle. I believe we should be acquiring the parent lock before we release the child lock.
Updated by Kristof Provost 5 months ago
Jim Pingle wrote in #note-2:
How exactly is someone making that sort of entry? It can't be made in the GUI via static routes, input validation rejects it. It can't be made at the CLI, the route command rejects it.
I have this in my test case to at least run the relevant code path:
route add -6 -net -inet 0.0.0.0/0 -inet6 2001:db8::1
The customer's routing table also has entries like this:
10.0.0.0/24 2001:db8:42::3 UG1 21 1500 lagg0.10
Updated by Kristof Provost 5 months ago
I've proposed this upstream: https://reviews.freebsd.org/D45913 and copied the original author of the relevant code.
Updated by Jim Pingle 5 months ago
- Status changed from Feedback to In Progress
- Assignee set to Kristof Provost
Updated by Jim Pingle 4 months ago
- Target version set to 2.8.0
- Plus Target Version set to 24.08
- Affected Version set to All
Updated by Kris Phillips 4 months ago
Jim Pingle wrote in #note-2:
How exactly is someone making that sort of entry? It can't be made in the GUI via static routes, input validation rejects it. It can't be made at the CLI, the route command rejects it.
This route was added by FRR BGP learning a route.
Updated by Mateusz Guzik 4 months ago
Note that these IPs like to be one instruction off. The __rw_wunlock_hard is just prior and it operates on the child -- the parent was not looked at yet. Therefore it is the child which failed to unlock.
Normally a panic like this means the value of the lock itself is corrupted -- the fast path fails and the fallback expects there are blocked threads waiting to be woken up. The crash stems from failing to find any.
For the buggy state to occur something had to damage the lock or there is a bug in locking primitives (I'm ruling out the latter though).
Would the customer be willing to run a kernel with certain debug facilities added? Performance should be about the same, but it should also shed a light on what's going on here.
I can prep everything tomorrow. It is very easy to plop a new kernel in, but I don't know if there is a blessed way here or there is some hand-holding for the customer needed. I'm counting on the support team here.
Updated by Mateusz Guzik 4 months ago
- Assignee changed from Kristof Provost to Mateusz Guzik
Updated by Jim Pingle about 1 month ago
- Plus Target Version changed from 24.08 to 24.11
Updated by Mateusz Guzik 29 days ago
The customer was shipped with 2 kernels. First added some debug and another added a workaround for the suspected issue.
The customer claims the crashes stopped and it was confirmed they are running the kernel variant which was expected to fix the issue.
However, they had a period of time where they were running the debug kernel which was expected to crash and did not (it did crash eventually).
Meaning we don't know for sure whether the problem is mitigated. The good news is that the mitigation is harmless, thus it landed for the time being: https://gitlab.netgate.com/pfSense/FreeBSD-src/-/commit/5b6ba89cd18f370f42c72e09c750e6ae5bc9a0a6 . It is going to point out in dmesg that it had to be used.
Updated by Jim Pingle 29 days ago
- Status changed from In Progress to Feedback
- % Done changed from 0 to 100