Project

General

Profile

Bug #6223

IPsec + OpenBGPD fails with "PF_KEY socket: No buffer space available"

Added by sht head 10 months ago. Updated 4 days ago.

Status:
Confirmed
Priority:
Normal
Assignee:
Category:
IPsec
Target version:
Start date:
04/21/2016
Due date:
% Done:

0%

Affected version:
2.3.x
Affected Architecture:

Description

When using pfSense 2.3 with the OpenBGPD package, IPSEC tunnels will drop out and never reconnect again until the server is rebooted. Killing the ipsec processes and starting the service again does not fix the issue, the only way I have found is a reboot. Removing the OpenBGPD package fixes the problem.

There is discussion on the forum about this issue here: https://forum.pfsense.org/index.php?topic=109908

charon-pfkey-event-buffer.patch Magnifier (783 Bytes) Jim Pingle, 09/06/2016 03:17 PM

ipsecmon.sh Magnifier - PFSense ipsec bug monitoring + bounce script (1.63 KB) Firstname Surname, 11/28/2016 08:41 AM

ipsecmon-jc.diff Magnifier (4.39 KB) James Cornman, 12/13/2016 07:51 AM

ipsecmon.sh Magnifier (2.98 KB) James Cornman, 12/13/2016 07:52 AM

Associated revisions

Revision ef885517
Added by Chris Buechler 10 months ago

Bump net.raw.recvspace and sendspace defaults. Ticket #6223

Revision c81678f4
Added by Chris Buechler 10 months ago

Bump net.raw.recvspace and sendspace defaults. Ticket #6223

Revision 48a8235e
Added by Chris Buechler 10 months ago

Bump net.inet.raw.recvspace and net.inet.raw.maxdgram by default. Ticket #6223

Revision 9fabe2c7
Added by Chris Buechler 10 months ago

Bump net.inet.raw.recvspace and net.inet.raw.maxdgram by default. Ticket #6223

History

#1 Updated by Chris Buechler 10 months ago

  • Subject changed from IPSEC with OpenBGPD Package to IPsec + OpenBGPD fails with "PF_KEY socket: No buffer space available"
  • Status changed from New to Confirmed
  • Assignee set to Chris Buechler

I'm looking into a good repeatable test case for this.

#2 Updated by Michael van der Weg 10 months ago

Chris Buechler wrote:

I'm looking into a good repeatable test case for this.

hi, i'm affected by this bug too. i can provide a test ipsec endpoint using an aws vpc and the required configuration for the pfsense side. if you wish i can also provide a configured pfsense box with root access that is affected by the issue.

#3 Updated by Chris Buechler 10 months ago

  • Target version set to 2.3.1

setting net.raw.recvspace=16384, twice the default, has been confirmed to fix this and one other unrelated IPsec failure where there was the same PF_KEY socket error.

#4 Updated by sht head 10 months ago

Chris Buechler wrote:

setting net.raw.recvspace=16384, twice the default, has been confirmed to fix this and one other unrelated IPsec failure where there was the same PF_KEY socket error.

I upgraded my stand by server again to 2.3 and set this in system tunables. After I rebooted the server.

Its been about an hour and it looks like my tunnels have all started to drop off again one by one, so this has not fixed it for me.

#5 Updated by Chris Buechler 10 months ago

  • Status changed from Confirmed to Feedback

I set it as committed on "shthead"'s system and it seems to be fine.

#6 Updated by Michael OBrien 10 months ago

Chris Buechler wrote:

I set it as committed on "shthead"'s system and it seems to be fine.

Still having this issue (running OpenBGPd + IPSec - transport phase 2 with GRE tunnels) after changing tunable to 16384, then 131072 per another recommendation online.

More:
https://forum.pfsense.org/index.php?topic=109908.30

#7 Updated by Chris Buechler 10 months ago

Michael OBrien wrote:

Still having this issue (running OpenBGPd + IPSec - transport phase 2 with GRE tunnels) after changing tunable to 16384, then 131072 per another recommendation online.

That's not as committed here. Set all 4 as done in the commits here, or upgrade to 2.3.1.

#8 Updated by Chris Buechler 10 months ago

  • Status changed from Feedback to Confirmed
  • Target version changed from 2.3.1 to 2.3.2

Those changes helped some instance of this, but definitely doesn't fix the problem for all.

#9 Updated by Michael OBrien 8 months ago

Chris Buechler wrote:

Michael OBrien wrote:

Still having this issue (running OpenBGPd + IPSec - transport phase 2 with GRE tunnels) after changing tunable to 16384, then 131072 per another recommendation online.

That's not as committed here. Set all 4 as done in the commits here, or upgrade to 2.3.1.

Same issue with upgrade to 2.3.1_5, any idea if this will be resolved in 2.3.2 or 2.4.x (FreeBSD 11, right?)

#10 Updated by Chris Buechler 8 months ago

  • Assignee deleted (Chris Buechler)
  • Target version changed from 2.3.2 to 2.4.0
  • Affected version changed from 2.3 to 2.3.x

bumping net.inet.raw.maxdgram, net.inet.raw.recvspace, net.raw.recvspace and net.raw.sendspace even further seems to at least make it work longer without encountering this issue.
https://forum.pfsense.org/index.php?topic=109908.msg623827#msg623827
but still failed with same after a month.
https://forum.pfsense.org/index.php?topic=109908.msg629807#msg629807

#11 Updated by Michael OBrien 6 months ago

Looks like there may be some progress here:
https://forum.pfsense.org/index.php?topic=109908.45

#12 Updated by Aaron Marks 6 months ago

I recommend changing this to a high priority bug as it impacts anyone using IPsec and BGP together which are two ubiquitious protocols. I've worked with pfSense support and this issue is confirmed. It says that 2.4 is the current targeted for patching this, but I'd also advise doing whatever it takes to fix this I'm 2.3.3.

#13 Updated by Jim Pingle 6 months ago

Anyone who can reproduce this: Try feeding the attached patch into the system patches package, which will add in the charon change mentioned on the forum post. Set path strip = 2 in the system patches package.

The patch will change the strongSwan config so it will either use $config['ipsec']['kernel_pfkey_events_buffer'] (no GUI knob, but you can set it by hand in the config) or the value of the net.inet.raw.recvspace sysctl oid.

#14 Updated by Per Hodneland 6 months ago

Applied attached patch, but that only pushes the problem in the near future. Still fails after x amount of days or hours. Would be nice to see a proper fix for this as using gre+ipsec with BGP is the core that we are currently using PFsense for.

#15 Updated by Jon Hayward 5 months ago

Hey all,

Do we know exactly what causes this yet?

Reason i ask is i have just had a 2.2.6 machine have this (been kept at 2.2.6 because of this exact issue on 2.3.x)

Sep 22 19:41:44 bgpd[64711]: dispatch_imsg in main: pipe closed
Sep 22 19:41:44 bgpd[65089]: session engine exiting
Sep 22 19:41:44 bgpd[64810]: route decision engine exiting
Sep 22 19:41:44 bgpd[65089]: writev (6/80): No buffer space available
Sep 22 19:39:54 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:37:04 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:34:36 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:32:08 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:29:39 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:28:34 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:27:30 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:26:26 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:25:50 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:25:18 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:25:12 bgpd[65089]: neighbor 10.255.255.2 (Yodafone): pfkey setup failed
Sep 22 19:25:12 bgpd[65089]: writev (8/104): No buffer space available
Sep 22 19:25:11 bgpd[65089]: writev (6/80): No buffer space available
Sep 22 19:24:41 bgpd[65089]: neighbor 10.255.255.2 (Yodafone): state change Established -> Idle, reason: HoldTimer expired

#16 Updated by Martin Hansen 4 months ago

I can word in on this, major issue.

#17 Updated by Michael OBrien 3 months ago

Has anyone attempted this with 2.4 beta? I've already burned my downtime allowance testing with 2.3.x versions and various patches, and don't have a test setup with a busy enough BGP + GRE/IPSec link to reliably repro this.

#18 Updated by Firstname Surname 3 months ago

To all having this problem - while there is no fix yet, I have put together a workaround I have been using successfully with 2.3.2 for a few months now with no issues. While it does not provide an uninterrupted service, it recovers every time. If you have redundant tunnels, chances are you will survive without issues.

The solution is:

a) increasing the pfkey buffer size as per the patch attached to this issue
b) a cron job to run the attached script (I run it every 2 minutes, but could well be every minute).

Do not ask me why this works, but I have found it that it becomes possible to recover from this condition by restarting IPSec and OpenBGPd once the pfkey buffer size is increased. The script picks up any IPSec sessions with phase 1 or phase 2 down and bounces them accordingly, and restarts both openbgpd and IPSec if either all sessions are down, or the buffer space error has appeared. The script is obviously very simple so modify accordingly, but it works for me.

#19 Updated by James Cornman 2 months ago

I've created a little patch to the ipsecmon.sh file to actually log the output using logger, and made it a little easier to read ;)

It will only log a subset of the output that is displayed from the CLI command so it doesn't clutter the log for diagnostic output..

Lastly, as a comment, I installed the cron package via the pfsense Packaage manager in lieu of just using crontab via the CLI..hopefully this will persist through minor updates, until the developers get the 2.4 fix out for this problem.

#20 Updated by Jim Pingle 2 months ago

As long as you're logging things, dump the output from /usr/bin/netstat -s -ppfkey as well to see if the errors in the logs correlate to any counters there.

#21 Updated by Jim Thompson about 2 months ago

  • Assignee set to Matthew Smith

#22 Updated by Frans Gidlöf 12 days ago

In 2.4 it flaps constantly... I mean every 40 seconds or so, but it varies

startup
rereading config
route decision engine ready
new ktable rdomain_0 for rtableid 0
RDE reconfigured
session engine ready
listening on 169.254.41.78
SE reconfigured
neighbor 169.254.41.77 (VPC): state change None -> Idle, reason: None
neighbor 169.254.41.77 (VPC): state change Idle -> Connect, reason: Start
neighbor 169.254.41.77 (VPC): state change Connect -> OpenSent, reason: Connection opened
neighbor 169.254.41.77 (VPC): state change OpenSent -> OpenConfirm, reason: OPEN message received
neighbor 169.254.41.77 (VPC): state change OpenConfirm -> Established, reason: KEEPALIVE message received
nexthop 169.254.41.77 now valid: via XXX.XXX.XXX.XXX
Traffic stops here
neighbor 169.254.41.77 (VPC): write error: Permission denied
neighbor 169.254.41.77 (VPC): state change Established -> Idle, reason: Fatal error
neighbor 169.254.41.77 (VPC): state change Idle -> Connect, reason: Start
neighbor 169.254.41.77 (VPC): state change Connect -> OpenSent, reason: Connection opened
neighbor 169.254.41.77 (VPC): state change OpenSent -> OpenConfirm, reason: OPEN message received
neighbor 169.254.41.77 (VPC): state change OpenConfirm -> Established, reason: KEEPALIVE message received
nexthop 169.254.41.77 now valid: via XXX.XXX.XXX.XXX

Against AWS VPC with dynamic routing via OpenBGPD, more stable in 2.3 but the same issue as the rest of this thread/bug.

Can provide a testing environment if needed.

#23 Updated by Wade Blackwell 4 days ago

I'm also seeing this issue over Ovpn site to site tunnels with static keys on 2.3.2-RELEASE-p1 (i386). The remote sites are running 2.3.2-RELEASE-p1 (amd64) but the issue appears to originate on the core site which is running i386 version. Can provide debugs and configs if needed.

Feb 20 20:02:21 bgpd 72490 neighbor 172.39.0.14 (Bonney Lake Campus): state change Idle -> Connect, reason: Start
Feb 20 20:02:21 bgpd 72490 neighbor 172.39.0.14 (Bonney Lake Campus): state change Established -> Idle, reason: Fatal error
Feb 20 20:02:21 bgpd 72490 neighbor 172.39.0.14 (Bonney Lake Campus): graceful restart of IPv4 unicast, keeping routes
Feb 20 20:02:21 bgpd 72490 neighbor 172.39.0.14 (Bonney Lake Campus): write error: Permission denied

Also available in: Atom PDF