Project

General

Profile

Actions

Bug #6223

closed

IPsec + OpenBGPD fails with "PF_KEY socket: No buffer space available"

Added by sht head over 8 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
IPsec
Target version:
-
Start date:
04/21/2016
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
2.3.x
Affected Architecture:

Description

When using pfSense 2.3 with the OpenBGPD package, IPSEC tunnels will drop out and never reconnect again until the server is rebooted. Killing the ipsec processes and starting the service again does not fix the issue, the only way I have found is a reboot. Removing the OpenBGPD package fixes the problem.

There is discussion on the forum about this issue here: https://forum.pfsense.org/index.php?topic=109908


Files

charon-pfkey-event-buffer.patch (783 Bytes) charon-pfkey-event-buffer.patch Jim Pingle, 09/06/2016 03:17 PM
ipsecmon.sh (1.63 KB) ipsecmon.sh PFSense ipsec bug monitoring + bounce script Firstname Surname, 11/28/2016 08:41 AM
ipsecmon-jc.diff (4.39 KB) ipsecmon-jc.diff James Cornman, 12/13/2016 07:51 AM
ipsecmon.sh (2.98 KB) ipsecmon.sh James Cornman, 12/13/2016 07:52 AM
Actions #1

Updated by Chris Buechler over 8 years ago

  • Subject changed from IPSEC with OpenBGPD Package to IPsec + OpenBGPD fails with "PF_KEY socket: No buffer space available"
  • Status changed from New to Confirmed
  • Assignee set to Chris Buechler

I'm looking into a good repeatable test case for this.

Actions #2

Updated by Michael van der Weg over 8 years ago

Chris Buechler wrote:

I'm looking into a good repeatable test case for this.

hi, i'm affected by this bug too. i can provide a test ipsec endpoint using an aws vpc and the required configuration for the pfsense side. if you wish i can also provide a configured pfsense box with root access that is affected by the issue.

Actions #3

Updated by Chris Buechler over 8 years ago

  • Target version set to 2.3.1

setting net.raw.recvspace=16384, twice the default, has been confirmed to fix this and one other unrelated IPsec failure where there was the same PF_KEY socket error.

Actions #4

Updated by sht head over 8 years ago

Chris Buechler wrote:

setting net.raw.recvspace=16384, twice the default, has been confirmed to fix this and one other unrelated IPsec failure where there was the same PF_KEY socket error.

I upgraded my stand by server again to 2.3 and set this in system tunables. After I rebooted the server.

Its been about an hour and it looks like my tunnels have all started to drop off again one by one, so this has not fixed it for me.

Actions #5

Updated by Chris Buechler over 8 years ago

  • Status changed from Confirmed to Feedback

I set it as committed on "shthead"'s system and it seems to be fine.

Actions #6

Updated by Michael OBrien over 8 years ago

Chris Buechler wrote:

I set it as committed on "shthead"'s system and it seems to be fine.

Still having this issue (running OpenBGPd + IPSec - transport phase 2 with GRE tunnels) after changing tunable to 16384, then 131072 per another recommendation online.

More:
https://forum.pfsense.org/index.php?topic=109908.30

Actions #7

Updated by Chris Buechler over 8 years ago

Michael OBrien wrote:

Still having this issue (running OpenBGPd + IPSec - transport phase 2 with GRE tunnels) after changing tunable to 16384, then 131072 per another recommendation online.

That's not as committed here. Set all 4 as done in the commits here, or upgrade to 2.3.1.

Actions #8

Updated by Chris Buechler over 8 years ago

  • Status changed from Feedback to Confirmed
  • Target version changed from 2.3.1 to 2.3.2

Those changes helped some instance of this, but definitely doesn't fix the problem for all.

Actions #9

Updated by Michael OBrien about 8 years ago

Chris Buechler wrote:

Michael OBrien wrote:

Still having this issue (running OpenBGPd + IPSec - transport phase 2 with GRE tunnels) after changing tunable to 16384, then 131072 per another recommendation online.

That's not as committed here. Set all 4 as done in the commits here, or upgrade to 2.3.1.

Same issue with upgrade to 2.3.1_5, any idea if this will be resolved in 2.3.2 or 2.4.x (FreeBSD 11, right?)

Actions #10

Updated by Chris Buechler about 8 years ago

  • Assignee deleted (Chris Buechler)
  • Target version changed from 2.3.2 to 2.4.0
  • Affected Version changed from 2.3 to 2.3.x

bumping net.inet.raw.maxdgram, net.inet.raw.recvspace, net.raw.recvspace and net.raw.sendspace even further seems to at least make it work longer without encountering this issue.
https://forum.pfsense.org/index.php?topic=109908.msg623827#msg623827
but still failed with same after a month.
https://forum.pfsense.org/index.php?topic=109908.msg629807#msg629807

Actions #11

Updated by Michael OBrien about 8 years ago

Looks like there may be some progress here:
https://forum.pfsense.org/index.php?topic=109908.45

Actions #12

Updated by Aaron Marks about 8 years ago

I recommend changing this to a high priority bug as it impacts anyone using IPsec and BGP together which are two ubiquitious protocols. I've worked with pfSense support and this issue is confirmed. It says that 2.4 is the current targeted for patching this, but I'd also advise doing whatever it takes to fix this I'm 2.3.3.

Actions #13

Updated by Jim Pingle about 8 years ago

Anyone who can reproduce this: Try feeding the attached patch into the system patches package, which will add in the charon change mentioned on the forum post. Set path strip = 2 in the system patches package.

The patch will change the strongSwan config so it will either use $config['ipsec']['kernel_pfkey_events_buffer'] (no GUI knob, but you can set it by hand in the config) or the value of the net.inet.raw.recvspace sysctl oid.

Actions #14

Updated by Per Hodneland almost 8 years ago

Applied attached patch, but that only pushes the problem in the near future. Still fails after x amount of days or hours. Would be nice to see a proper fix for this as using gre+ipsec with BGP is the core that we are currently using PFsense for.

Actions #15

Updated by Jon Hayward almost 8 years ago

Hey all,

Do we know exactly what causes this yet?

Reason i ask is i have just had a 2.2.6 machine have this (been kept at 2.2.6 because of this exact issue on 2.3.x)

Sep 22 19:41:44 bgpd[64711]: dispatch_imsg in main: pipe closed
Sep 22 19:41:44 bgpd[65089]: session engine exiting
Sep 22 19:41:44 bgpd[64810]: route decision engine exiting
Sep 22 19:41:44 bgpd[65089]: writev (6/80): No buffer space available
Sep 22 19:39:54 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:37:04 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:34:36 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:32:08 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:29:39 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:28:34 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:27:30 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:26:26 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:25:50 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:25:18 bgpd[65089]: Connection attempt from neighbor 10.255.255.2 (Yodafone) while session is in state Idle
Sep 22 19:25:12 bgpd[65089]: neighbor 10.255.255.2 (Yodafone): pfkey setup failed
Sep 22 19:25:12 bgpd[65089]: writev (8/104): No buffer space available
Sep 22 19:25:11 bgpd[65089]: writev (6/80): No buffer space available
Sep 22 19:24:41 bgpd[65089]: neighbor 10.255.255.2 (Yodafone): state change Established -> Idle, reason: HoldTimer expired

Actions #16

Updated by Martin Hansen almost 8 years ago

I can word in on this, major issue.

Actions #17

Updated by Michael OBrien almost 8 years ago

Has anyone attempted this with 2.4 beta? I've already burned my downtime allowance testing with 2.3.x versions and various patches, and don't have a test setup with a busy enough BGP + GRE/IPSec link to reliably repro this.

Actions #18

Updated by Firstname Surname almost 8 years ago

To all having this problem - while there is no fix yet, I have put together a workaround I have been using successfully with 2.3.2 for a few months now with no issues. While it does not provide an uninterrupted service, it recovers every time. If you have redundant tunnels, chances are you will survive without issues.

The solution is:

a) increasing the pfkey buffer size as per the patch attached to this issue
b) a cron job to run the attached script (I run it every 2 minutes, but could well be every minute).

Do not ask me why this works, but I have found it that it becomes possible to recover from this condition by restarting IPSec and OpenBGPd once the pfkey buffer size is increased. The script picks up any IPSec sessions with phase 1 or phase 2 down and bounces them accordingly, and restarts both openbgpd and IPSec if either all sessions are down, or the buffer space error has appeared. The script is obviously very simple so modify accordingly, but it works for me.

Actions #19

Updated by James Cornman over 7 years ago

I've created a little patch to the ipsecmon.sh file to actually log the output using logger, and made it a little easier to read ;)

It will only log a subset of the output that is displayed from the CLI command so it doesn't clutter the log for diagnostic output..

Lastly, as a comment, I installed the cron package via the pfsense Packaage manager in lieu of just using crontab via the CLI..hopefully this will persist through minor updates, until the developers get the 2.4 fix out for this problem.

Actions #20

Updated by Jim Pingle over 7 years ago

As long as you're logging things, dump the output from /usr/bin/netstat -s -ppfkey as well to see if the errors in the logs correlate to any counters there.

Actions #21

Updated by Jim Thompson over 7 years ago

  • Assignee set to Matthew Smith
Actions #22

Updated by Frans Gidlöf over 7 years ago

In 2.4 it flaps constantly... I mean every 40 seconds or so, but it varies

startup
rereading config
route decision engine ready
new ktable rdomain_0 for rtableid 0
RDE reconfigured
session engine ready
listening on 169.254.41.78
SE reconfigured
neighbor 169.254.41.77 (VPC): state change None -> Idle, reason: None
neighbor 169.254.41.77 (VPC): state change Idle -> Connect, reason: Start
neighbor 169.254.41.77 (VPC): state change Connect -> OpenSent, reason: Connection opened
neighbor 169.254.41.77 (VPC): state change OpenSent -> OpenConfirm, reason: OPEN message received
neighbor 169.254.41.77 (VPC): state change OpenConfirm -> Established, reason: KEEPALIVE message received
nexthop 169.254.41.77 now valid: via XXX.XXX.XXX.XXX
Traffic stops here
neighbor 169.254.41.77 (VPC): write error: Permission denied
neighbor 169.254.41.77 (VPC): state change Established -> Idle, reason: Fatal error
neighbor 169.254.41.77 (VPC): state change Idle -> Connect, reason: Start
neighbor 169.254.41.77 (VPC): state change Connect -> OpenSent, reason: Connection opened
neighbor 169.254.41.77 (VPC): state change OpenSent -> OpenConfirm, reason: OPEN message received
neighbor 169.254.41.77 (VPC): state change OpenConfirm -> Established, reason: KEEPALIVE message received
nexthop 169.254.41.77 now valid: via XXX.XXX.XXX.XXX

Against AWS VPC with dynamic routing via OpenBGPD, more stable in 2.3 but the same issue as the rest of this thread/bug.

Can provide a testing environment if needed.

Actions #23

Updated by Wade Blackwell over 7 years ago

I'm also seeing this issue over Ovpn site to site tunnels with static keys on 2.3.2-RELEASE-p1 (i386). The remote sites are running 2.3.2-RELEASE-p1 (amd64) but the issue appears to originate on the core site which is running i386 version. Can provide debugs and configs if needed.

Feb 20 20:02:21 bgpd 72490 neighbor 172.39.0.14 (Bonney Lake Campus): state change Idle -> Connect, reason: Start
Feb 20 20:02:21 bgpd 72490 neighbor 172.39.0.14 (Bonney Lake Campus): state change Established -> Idle, reason: Fatal error
Feb 20 20:02:21 bgpd 72490 neighbor 172.39.0.14 (Bonney Lake Campus): graceful restart of IPv4 unicast, keeping routes
Feb 20 20:02:21 bgpd 72490 neighbor 172.39.0.14 (Bonney Lake Campus): write error: Permission denied

Actions #24

Updated by Jim Thompson over 7 years ago

  • Assignee changed from Matthew Smith to Luiz Souza
Actions #25

Updated by Michael OBrien over 7 years ago

Has anyone been able to test this with 2.4? Unfortunately I don't have a good test environment with IPSEC + BGP.

Actions #26

Updated by josue escalante about 7 years ago

Any progress on this?

Actions #27

Updated by Jim Pingle about 7 years ago

Only in that we're making progress on replacing OpenBGPD with FRR, which hopefully will not suffer from the same issue(s).

It is worth testing on 2.4 as well to see if the newer base OS helps.

Actions #28

Updated by Michael OBrien about 7 years ago

Jim Pingle wrote:

Only in that we're making progress on replacing OpenBGPD with FRR

Well that's exciting! I assume this is a super long-term thing?

Actions #29

Updated by Luiz Souza about 7 years ago

  • Target version changed from 2.4.0 to 2.4.1
Actions #30

Updated by Jim Pingle almost 7 years ago

FYI- FRR is now available for 2.4, 2.3.5 (snapshots), and 2.3.4 users. Internal tests show that it does not suffer from this problem.

If the problem is specific to OpenBGPD then replacing OpenBGPD with FRR seems to be the better path forward at the moment.

Actions #31

Updated by Jim Pingle almost 7 years ago

  • Target version changed from 2.4.1 to 2.4.2
Actions #32

Updated by Michael OBrien almost 7 years ago

Jim Pingle wrote:

FYI- FRR is now available for 2.4, 2.3.5 (snapshots), and 2.3.4 users. Internal tests show that it does not suffer from this problem.

If the problem is specific to OpenBGPD then replacing OpenBGPD with FRR seems to be the better path forward at the moment.

I suspect I'll have a test case for this in the next few weeks, implementing BGP with a big carrier for private mobile network using 2.4.0 with frr. Is there a reason you're moving this to 2.4.2, or you just need confirmation that it's good to go?

Actions #33

Updated by Jim Pingle almost 7 years ago

Michael OBrien wrote:

Is there a reason you're moving this to 2.4.2, or you just need confirmation that it's good to go?

We would like the see the original problem fixed as well, if possible. The workaround (FRR) is better but we don't necessarily want to consider the matter closed entirely yet. Confirmation also helps. As long as there is a viable workaround it doesn't hurt for this issue to remain open so we can keep an eye on it.

In addition to FRR, FreeBSD 11.1 has some significant changes to the IPsec stack, so it's worth re-tested the original bug there.

Actions #34

Updated by Jim Pingle almost 7 years ago

  • Target version changed from 2.4.2 to 2.4.3
Actions #35

Updated by Andrew Wasilczuk almost 7 years ago

I can confirm that this is still an issue on 2.4.0

Switching to FRR solved this for me.

Actions #36

Updated by Mitch Claborn almost 7 years ago

What is the process for switching to FRR? Do I just install the FRR package or is there more to it?

Actions #37

Updated by Jim Pingle almost 7 years ago

Mitch Claborn wrote:

What is the process for switching to FRR? Do I just install the FRR package or is there more to it?

That's more of a topic for the forum. tl;dr is that it's a completely separate package. Remove OpenBGPD, install FRR, configure FRR for BGP. If you need more help, follow up on the forum.

Actions #38

Updated by Jim Pingle over 6 years ago

  • Status changed from Confirmed to Closed

It's still broken with FreeBSD 11.x and OpenBGPD and it's unclear if that combination will be fixed upstream.

If you need BGP with IPsec, remove the OpenBGPD packages and install FRR instead. FRR+IPsec has been confirmed to work fine by multiple sources.

Actions #39

Updated by Jim Pingle over 6 years ago

  • Target version deleted (2.4.3)
Actions #40

Updated by xavier Lemaire over 6 years ago

just make upgrade to 2.4.3-RELEASE (amd64) built on Mon Mar 26 18:02:04 CDT 2018

I have openbgp (ok i ll move to FRR one of those nights to come)...and I have CARP

after about 14 hrs i got ipsec fall
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> error sending to PF_KEY socket: No buffer space available
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> unable to delete SAD entry with SPI c3fe38a9
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> deleting SPI allocation SA failed
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> error sending to PF_KEY socket: No buffer space available
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> unable to add SAD entry with SPI c3fe38a9
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> error sending to PF_KEY socket: No buffer space available
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> unable to add SAD entry with SPI c87d469d
Apr 3 04:31:22 46.28.168.123 charon: 07[IKE] <con1000|6> unable to install inbound and outbound IPsec SA (SAD) in kernel
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> error sending to PF_KEY socket: No buffer space available
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> unable to delete SAD entry with SPI c3fe38a9
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> error sending to PF_KEY socket: No buffer space available
Apr 3 04:31:22 46.28.168.123 charon: 07[KNL] <con1000|6> unable to delete SAD entry with SPI c87d469d
Apr 3 04:31:22 46.28.168.123 charon: 07[IKE] <con1000|6> sending DELETE for ESP CHILD_SA with SPI c87d469d

Actions #41

Updated by xavier Lemaire over 6 years ago

Just finish to migrate to FRRouting

IPV4 OK but IPV6 bad dream... fortunately there is a great thing called vtysh
for those who go by there i advise you to look at the neighbor 2001 option: neighbor xxx:xxx::xxx next-hop-self force

second advice : be patient, be patient with 22 peers the routing tables put an astronomical time before updating with a bgpd process that burns a core for very very long minutes... it is clear that my cpu must be a fucking wheelbarrow : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz 16 CPUs: 1 package(s) x 8 core(s) x 2 hardware threads ...

nice job thx

Actions #42

Updated by Roman H about 6 years ago

Bump.
Issue still persist.
Installed OpenBGPd for get pfsense connected to AWS via BGP , and also having IPsec IKE v2 to homesite - and its loosing P2 connections after ~24Hrs.
I began to search - and found this bug in tracker.
Tried to increase net.raw.recvspace=16384 and 32768 - no help, still dropping after some time.

Also will migrate to FRR, but this definetly should be fixed

Actions

Also available in: Atom PDF