Project

General

Profile

Actions

Bug #13569

open

Restarting an OpenVPN server running on a CARP VIP in an HA cluster can disrupt unrelated TCP states

Added by Azamat Khakimyanov about 2 years ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
CARP
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Release Notes:
Default
Affected Plus Version:
22.05
Affected Architecture:

Description

Our customer (Ticket #1161128024) pointed out on possible problem with HA cluster and TCP streams. During troubleshooting customer found out that having OpenVPN Server running on VIP (WAN CARP VIP or IP Alias bundled with WAN CARP VIP) causes this issue: during failover all TCP streams break down.

I was able to reproduce this issue:
- HA cluster
- OpenVPN Server with WAN CARP VIP as an Interface
- downloading process (FreeBSD image) and TCP stream (VLC with Network Stream: http://webcam.rhein-taunus-krematorium.de/mjpg/video.mjpg) running on internal host
- active RA OpenVPN connection from external host

Putting Primary into Persistent CARP Maintenance mode destroyed both downloading process and TCP stream:
- System log on Primary
Oct 17 08:08:24 check_reload_status 389 Reloading filter
Oct 17 08:08:24 kernel ovpns1: link state changed to DOWN
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: XMLRPC reload data success with https://10.10.99.2:443/xmlrpc.php (pfsense.restore_config_section).
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: Beginning XMLRPC sync data to https://10.10.99.2:443/xmlrpc.php.
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: XMLRPC versioncheck: 22.7 -- 22.7
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: XMLRPC reload data success with https://10.10.99.2:443/xmlrpc.php (pfsense.host_firmware_version).
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: Beginning XMLRPC sync data to https://10.10.99.2:443/xmlrpc.php.
Oct 17 08:08:21 check_reload_status 389 Carp backup event
Oct 17 08:08:21 check_reload_status 389 Carp backup event
Oct 17 08:08:21 kernel carp: 3@vtnet1: MASTER -> BACKUP (more frequent advertisement received)
Oct 17 08:08:21 kernel carp: 2@vtnet0: MASTER -> BACKUP (more frequent advertisement received)
Oct 17 08:08:21 kernel carp: 4@vtnet2: MASTER -> BACKUP (more frequent advertisement received)
Oct 17 08:08:21 check_reload_status 389 Carp backup event
Oct 17 08:08:20 check_reload_status 389 Syncing firewall
Oct 17 08:08:20 php-fpm 5328 /status_carp.php: Configuration Change: (Local Database): Enter CARP maintenance mode

- System log on Secondary node
Oct 17 08:08:25 php-fpm 359 /rc.start_packages: Restarting/Starting all packages.
Oct 17 08:08:24 check_reload_status 389 Starting packages
Oct 17 08:08:24 check_reload_status 389 Reloading filter
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - -> 172.27.240.1 - Restarting packages.
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: rc.newwanip called with empty interface.
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: rc.newwanip: on (IP address: 172.27.240.1) (interface: []) (real interface: ovpns1).
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: rc.newwanip: Info: starting on ovpns1.
Oct 17 08:08:23 check_reload_status 389 rc.newwanip starting ovpns1
Oct 17 08:08:23 check_reload_status 389 Reloading filter
Oct 17 08:08:23 kernel ovpns1: link state changed to UP
Oct 17 08:08:21 check_reload_status 389 Reloading filter
Oct 17 08:08:21 check_reload_status 389 Syncing firewall
Oct 17 08:08:21 php-fpm 62370 /xmlrpc.php: Configuration Change: (system)@10.10.99.1: Merged in config (staticroutes, gateways, virtualip, system, hasync, aliases, ca, cert, crl, dhcpd, dnshaper, filter, ipsec, nat, openvpn, schedules, shaper, unbound, wol sections) from XMLRPC client.
Oct 17 08:08:21 check_reload_status 389 Carp master event
Oct 17 08:08:21 check_reload_status 389 Carp master event
Oct 17 08:08:21 kernel arp: 10.10.130.1 moved from 00:00:5e:00:01:04 to 52:54:00:33:21:c0 on vtnet2
Oct 17 08:08:21 kernel carp: 2@vtnet0: BACKUP -> MASTER (preempting a slower master)
Oct 17 08:08:21 kernel carp: 3@vtnet1: BACKUP -> MASTER (preempting a slower master)
Oct 17 08:08:21 kernel carp: 4@vtnet2: BACKUP -> MASTER (preempting a slower master)
Oct 17 08:08:21 check_reload_status 389 Carp master event

Actions #1

Updated by Azamat Khakimyanov about 2 years ago

forgot to add: without OpenVPN running on VIP or even with OpenVPN runnning on WAN, there is no problem with TCP streams during HA failover test.

Actions #2

Updated by Chris Linstruth about 2 years ago

Verified. Running OpenVPN server bound to Localhost and port forwarding an IP Alias/CARP VIP to it looks like a reasonable workaround for now.

The state vanishes from the cluster member being failed to.

It remains active on the other node so there’s some sort of breakdown in pfsync there too.

The state will hang around on the original node since the app will die and it’ll just sit there in ESTABLISHED:ESTABLISHED for 24 hours.

At least in this test case. VLC gives up pretty quickly.

So 1. The state shouldn’t be killed. 2. if it is killed it should be deleted on the original node by pfsync.

Actions #3

Updated by Chris Linstruth about 2 years ago

  • Subject changed from OpenVPN running on VIP on HA cluster causes duscruption of TCP streams during failover test to OpenVPN running on VIP on HA cluster causes disruption of TCP streams during failover
Actions #4

Updated by Jim Pingle about 2 years ago

  • Subject changed from OpenVPN running on VIP on HA cluster causes disruption of TCP streams during failover to Restarting an OpenVPN server running on a CARP VIP in an HA cluster can disrupt unrelated TCP states

A few points here after working with cjl a bit trying to narrow it down:

  • The states that disappear are not directly related to the VPN (not traffic to/from the VPN or going through the VPN)
  • There are no log messages on either node indicating that states are being cleared, so none of the obvious code paths are being hit where it might kill or flush states
  • It also happens in certain cases when saving the VPN server, not just during a CARP transition
  • None of the options are enabled to kill states on gateway failure or IP address change
  • There are no logged auth failures / neither system is in the sshlockout table

Adjusted the subject to more closely align with the observed behavior

Actions #5

Updated by Marcos M about 2 years ago

Additional notes while working with cjl:
Commenting out the line /sbin/pfctl -i $1 -Fs in /usr/local/sbin/ovpn-linkdown "fixes" the issue. However, the script should not be getting called unless the service has already started and stopped; in the case of a failover to the secondary, the service should not be running while in BACKUP status, hence the script shouldn't be executed.

Actions #6

Updated by Florian Apolloner over 1 year ago

Hi there, I think I am seeing the same issue (on 23.05). I also do have OpenVPN on CARP IPs as of now (though openvpn might not be the only factor causing this). I'll add as much detail here as possible, let me know what you think.

My setup is like this:

client(192.168.11.43) -- (192.168.11.2 on vtnet2.511) pfsense1 (10.7.200.2 on vtnet2.192) -- target(10.7.200.12)
                       | (192.168.11.3 on vtnet2.511) pfsense2 (10.7.200.3 on vtnet2.192) |

So a pretty normal HA-Cluster. Client & Target are both in local networks, OpenVPN is not involved in the connections but is running. During a failover I see no errors anywhere but an SSH session from client to target hangs. Associated with that I see the following blocks in the filter log:

Jun  9 14:28:57 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.511,match,block,in,4,0x48,,64,41324,0,DF,6,tcp,124,192.168.11.43,10.7.200.12,51568,22,72,PA,1544295413:1544295485,73422640,501,,nop;nop;TS
Jun  9 14:30:07 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44921,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS
Jun  9 14:30:08 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44922,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS
Jun  9 14:30:08 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44923,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS
Jun  9 14:30:08 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44924,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS
Jun  9 14:30:09 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44925,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS
Jun  9 14:30:11 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44926,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS

This indicates to me that pfSense2 doesn't have the required states or is not able to match them up properly.

On the other hand, grepping for those states properly shows them on both firewalls:

pfctl -s states | grep '10.7.200.12:22'
all tcp 10.7.200.12:22 <- 10.7.22.120:51714       ESTABLISHED:ESTABLISHED
all tcp 10.7.22.120:51714 -> 10.7.200.12:22       ESTABLISHED:ESTABLISHED

Actions #7

Updated by Florian Apolloner over 1 year ago

Marcos M wrote in #note-5:

Additional notes while working with cjl:
Commenting out the line /sbin/pfctl -i $1 -Fs in /usr/local/sbin/ovpn-linkdown "fixes" the issue. However, the script should not be getting called unless the service has already started and stopped; in the case of a failover to the secondary, the service should not be running while in BACKUP status, hence the script shouldn't be executed.

But this would executed on the master during failover, no? And whatever that does to states would get synced to the secondary?

As an extra note: If I fail back to the primary again, the SSH connections resumes working…

Actions #8

Updated by Florian Apolloner over 1 year ago

I am able to reproduce the issue and I can also confirm that the issue is gone if I comment out /sbin/pfctl -i $1 -Fs. I can also confirm that the states actually vanish from the second firewall during the failover (so something must be actively deleting them?). Interestingly enough if I manually execute /sbin/pfctl -i ovpns1 -Fs it says 0 states cleared and nothing happens on the secondary firewall :/ I am somewhat out of ideas.

Actions #9

Updated by Florian Apolloner over 1 year ago

Debugging even further this seems to be timing sensitive. If I run pfctl -i ovpns1 -Fs && pfSctl -c 'filter reload all' on the primary, then for a short time the state is missing on the standby. So maybe between the carp events, filter reload and flushing stuff on ovpns1 something weird happens. Is there a way to monitor state tables changes?

Actions #10

Updated by Marcos M over 1 year ago

Actions #11

Updated by Florian Apolloner over 1 year ago

I don't think those two are related.

Actions #12

Updated by Sebastiano Degan about 1 year ago

Same issue on pfsense 2.7,
I confirm that

commenting out the line /sbin/pfctl -i $1 -Fs
works

Actions

Also available in: Atom PDF