Project

General

Profile

Bug #5428

Frequent IPv6 panic on 2.3 - May be log-related

Added by Jim Pingle almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Very High
Assignee:
Category:
Operating System
Target version:
Start date:
11/12/2015
Due date:
% Done:

0%

Estimated time:
Affected Version:
2.3
Affected Architecture:
amd64

Description

Hit a panic on 2.3 twice now, ping me for the full trace.

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x28
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80ba8a8b
stack pointer           = 0x28:0xfffffe003f3d3f40
frame pointer           = 0x28:0xfffffe003f3d3f50
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (irq256: igb0:que 0)
[ thread pid 12 tid 100028 ]
db:0:kdb.enter.default>  show pcpu
cpuid        = 0
dynamic pcpu = 0x502400
curthread    = 0xfffff800034c9940: pid 12 "irq256: igb0:que 0" 
curpcb       = 0xfffffe003f3d4cc0
fpcurthread  = none
idlethread   = 0xfffff80003290000: tid 100003 "idle: cpu0" 
curpmap      = 0xffffffff822b4928
tssp         = 0xffffffff822cf790
commontssp   = 0xffffffff822cf790
rsp0         = 0xfffffe003f3d4cc0
gs32p        = 0xffffffff822d11e8
ldt          = 0xffffffff822d1228
tss          = 0xffffffff822d1218
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100028 td 0xfffff800034c9940
strlen() at strlen+0xb/frame 0xfffffe003f3d3f50
kvprintf() at kvprintf+0xf9c/frame 0xfffffe003f3d4060
_vprintf() at _vprintf+0x8d/frame 0xfffffe003f3d4140
log() at log+0x5c/frame 0xfffffe003f3d41a0
ip6_forward() at ip6_forward+0x119/frame 0xfffffe003f3d42f0
pf_refragment6() at pf_refragment6+0x16e/frame 0xfffffe003f3d43b0
pf_test6() at pf_test6+0x1448/frame 0xfffffe003f3d46e0
pf_check6_out() at pf_check6_out+0x1d/frame 0xfffffe003f3d4700
pfil_run_hooks() at pfil_run_hooks+0x8d/frame 0xfffffe003f3d4790
bridge_pfil() at bridge_pfil+0x25b/frame 0xfffffe003f3d4810
bridge_broadcast() at bridge_broadcast+0x22c/frame 0xfffffe003f3d4880
bridge_forward() at bridge_forward+0x245/frame 0xfffffe003f3d48e0
bridge_input() at bridge_input+0x2a0/frame 0xfffffe003f3d4950
ether_nh_input() at ether_nh_input+0x2a5/frame 0xfffffe003f3d49b0
netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe003f3d4a20
igb_rxeof() at igb_rxeof+0x63b/frame 0xfffffe003f3d4ad0
igb_msix_que() at igb_msix_que+0x16d/frame 0xfffffe003f3d4b20
intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame 0xfffffe003f3d4b60
ithread_loop() at ithread_loop+0x96/frame 0xfffffe003f3d4bb0
fork_exit() at fork_exit+0x9a/frame 0xfffffe003f3d4bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003f3d4bf0
--- trap 0, rip = 0, rsp = 0xfffffe003f3d4cb0, rbp = 0 ---

I may have a means to reproduce it, but as it is my prod edge firewall I'm not too keen on testing that theory frequently.

Luiz said "I'll take a look, seems that something is wrong with log (a possible null string is passed to strlen)"

History

#1 Updated by Jim Thompson almost 4 years ago

There are only two calls to log() in ip6_forward()

https://github.com/pfsense/FreeBSD-src/blob/13010d6b0da4d97e56243edbea0a585b8285cd3e/sys/netinet6/ip6_forward.c#L391


        if (V_ip6_log_time + V_ip6_log_interval < time_uptime) {
            V_ip6_log_time = time_uptime;
            log(LOG_DEBUG,
                "cannot forward " 
                "src %s, dst %s, nxt %d, rcvif %s, outif %s\n",
                ip6_sprintf(ip6bufs, &ip6->ip6_src),
                ip6_sprintf(ip6bufd, &ip6->ip6_dst),
                ip6->ip6_nxt,
                if_name(m->m_pkthdr.rcvif), if_name(rt->rt_ifp));
        } 

This occurs if a packet can't be delivered to its destination for the reason that the destination is beyond the scope of the source address

and

https://github.com/pfsense/FreeBSD-src/blob/13010d6b0da4d97e56243edbea0a585b8285cd3e/sys/netinet6/ip6_forward.c#L126

        if (V_ip6_log_time + V_ip6_log_interval < time_uptime) {
            V_ip6_log_time = time_uptime;
            log(LOG_DEBUG,
                "cannot forward " 
                "from %s to %s nxt %d received on %s\n",
                ip6_sprintf(ip6bufs, &ip6->ip6_src),
                ip6_sprintf(ip6bufd, &ip6->ip6_dst),
                ip6->ip6_nxt,
                if_name(m->m_pkthdr.rcvif));
        }

This occurs if the the dest addr is broadcast, multicast or the source is "unspecified" (an inaddr ANY)

neither should cause a crash, but... are you sure you don't have something in pf that is rewritring to crap?
the presence of 'bridge' makes me itch.

Having a better stacktrace would be good.

and your config, or at least your pf.conf

#2 Updated by Luiz Souza almost 4 years ago

I think the m->m_pkthdr.rcvif is undefined here. The interface name can't be NULL (because this is correctly handled in sys/kern/subr_prf.c), it is probably pointing to a random address:

801                 case 's':
802 p = va_arg(ap, char *);
803 if (p == NULL)
804 p = "(null)";
805 if (!dot)
806 n = strlen (p);
807 else
808 for (n = 0; n < dwidth && p[n]; n++)
809 continue;

https://git.pfmechanics.com/luiz/freebsd-src/commit/85293302a79ecc6eeb30adeb236a8678f9287e16

Is my first try at this problem, but not sure (yet) if it is correct or complete. I'll arrange to JimP get a snapshot with this change.

#3 Updated by Jim Pingle almost 4 years ago

Not sure why yet but it appears to be tied to Bonjour traffic on the local network. Disabling or enabling the Bonjour plugin in Pidgin crashes the firewall instantly for me and for Renato, though we have not as yet replicated that in a test structure. More info coming as we dig deeper.

#4 Updated by Jim Pingle almost 4 years ago

I checked in a copy of a Packet capture containing traffic that crashes the firewall into the ESFprojects repo under redmine-5428/

#5 Updated by Jim Pingle almost 4 years ago

A few more details, getting closer to the heart of it:

  • The crash appears to require the presence of a bridge on the receiving interface (e.g. LAN and LAN2 bridged, packet comes in LAN)
  • The WAN type does not matter (dhcp6, HE.net, etc)
  • Replaying the capture file to a system configured with the above scenario will cause it to crash, so there is no need to fiddle with clients to replicate the problem (e.g. Testing from a linux client, run "sudo tcpreplay --intf1=eth0 ipv6-crash-redmine-#5428.pcapng")
  • Judging by the contents of the packet capture, it's a fragmented v6 multicast packet causing it
  • The traffic must be passed by IPv6 firewall rules (make sure the rule allows all from any/to any) -- if the traffic is blocked it does not crash.

Renato applied a patch from Luiz and tested it and received the following log message (and no crash):

Nov 18 13:02:05 pfgarga kernel: cannot forward from fe80::3e15:c2ff:fec8:1414 to ff02::fb nxt 44 received on (null)

#6 Updated by Renato Botelho almost 4 years ago

  • Status changed from New to Feedback

#7 Updated by Jim Pingle almost 4 years ago

  • Status changed from Feedback to Resolved

This is working OK now, I updated my edge firewall that originally showed the problem and replayed the traffic against it, no crash. System log showed the expected log message with the interface filled in. Looks like we can close it.

Also available in: Atom PDF