Bug #13569: Restarting an OpenVPN server running on a CARP VIP in an HA cluster can disrupt unrelated TCP states - pfSense Plus - pfSense bugtracker

Actions

Copy link

Bug #13569

open

Restarting an OpenVPN server running on a CARP VIP in an HA cluster can disrupt unrelated TCP states

Added by Azamat Khakimyanov about 3 years ago. Updated 3 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

CARP

Target version:

Start date:

Due date:

% Done:

Estimated time:

Release Notes:

Default

Affected Plus Version:

22.05

Affected Architecture:

Description

Our customer (Ticket #1161128024) pointed out on possible problem with HA cluster and TCP streams. During troubleshooting customer found out that having OpenVPN Server running on VIP (WAN CARP VIP or IP Alias bundled with WAN CARP VIP) causes this issue: during failover all TCP streams break down.

I was able to reproduce this issue:
- HA cluster
- OpenVPN Server with WAN CARP VIP as an Interface
- downloading process (FreeBSD image) and TCP stream (VLC with Network Stream: http://webcam.rhein-taunus-krematorium.de/mjpg/video.mjpg) running on internal host
- active RA OpenVPN connection from external host

Putting Primary into Persistent CARP Maintenance mode destroyed both downloading process and TCP stream:
- System log on Primary
Oct 17 08:08:24 check_reload_status 389 Reloading filter
Oct 17 08:08:24 kernel ovpns1: link state changed to DOWN
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: XMLRPC reload data success with https://10.10.99.2:443/xmlrpc.php (pfsense.restore_config_section).
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: Beginning XMLRPC sync data to https://10.10.99.2:443/xmlrpc.php.
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: XMLRPC versioncheck: 22.7 -- 22.7
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: XMLRPC reload data success with https://10.10.99.2:443/xmlrpc.php (pfsense.host_firmware_version).
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: Beginning XMLRPC sync data to https://10.10.99.2:443/xmlrpc.php.
Oct 17 08:08:21 check_reload_status 389 Carp backup event
Oct 17 08:08:21 check_reload_status 389 Carp backup event
Oct 17 08:08:21 kernel carp: 3@vtnet1: MASTER -> BACKUP (more frequent advertisement received)
Oct 17 08:08:21 kernel carp: 2@vtnet0: MASTER -> BACKUP (more frequent advertisement received)
Oct 17 08:08:21 kernel carp: 4@vtnet2: MASTER -> BACKUP (more frequent advertisement received)
Oct 17 08:08:21 check_reload_status 389 Carp backup event
Oct 17 08:08:20 check_reload_status 389 Syncing firewall
Oct 17 08:08:20 php-fpm 5328 /status_carp.php: Configuration Change: admin@192.168.122.1 (Local Database): Enter CARP maintenance mode

- System log on Secondary node
Oct 17 08:08:25 php-fpm 359 /rc.start_packages: Restarting/Starting all packages.
Oct 17 08:08:24 check_reload_status 389 Starting packages
Oct 17 08:08:24 check_reload_status 389 Reloading filter
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - -> 172.27.240.1 - Restarting packages.
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: rc.newwanip called with empty interface.
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: rc.newwanip: on (IP address: 172.27.240.1) (interface: []) (real interface: ovpns1).
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: rc.newwanip: Info: starting on ovpns1.
Oct 17 08:08:23 check_reload_status 389 rc.newwanip starting ovpns1
Oct 17 08:08:23 check_reload_status 389 Reloading filter
Oct 17 08:08:23 kernel ovpns1: link state changed to UP
Oct 17 08:08:21 check_reload_status 389 Reloading filter
Oct 17 08:08:21 check_reload_status 389 Syncing firewall
Oct 17 08:08:21 php-fpm 62370 /xmlrpc.php: Configuration Change: (system)@10.10.99.1: Merged in config (staticroutes, gateways, virtualip, system, hasync, aliases, ca, cert, crl, dhcpd, dnshaper, filter, ipsec, nat, openvpn, schedules, shaper, unbound, wol sections) from XMLRPC client.
Oct 17 08:08:21 check_reload_status 389 Carp master event
Oct 17 08:08:21 check_reload_status 389 Carp master event
Oct 17 08:08:21 kernel arp: 10.10.130.1 moved from 00:00:5e:00:01:04 to 52:54:00:33:21:c0 on vtnet2
Oct 17 08:08:21 kernel carp: 2@vtnet0: BACKUP -> MASTER (preempting a slower master)
Oct 17 08:08:21 kernel carp: 3@vtnet1: BACKUP -> MASTER (preempting a slower master)
Oct 17 08:08:21 kernel carp: 4@vtnet2: BACKUP -> MASTER (preempting a slower master)
Oct 17 08:08:21 check_reload_status 389 Carp master event

Actions

Copy link

Updated by Azamat Khakimyanov about 3 years ago

forgot to add: without OpenVPN running on VIP or even with OpenVPN runnning on WAN, there is no problem with TCP streams during HA failover test.

Actions

Copy link

Updated by Chris Linstruth about 3 years ago

Verified. Running OpenVPN server bound to Localhost and port forwarding an IP Alias/CARP VIP to it looks like a reasonable workaround for now.

The state vanishes from the cluster member being failed to.

It remains active on the other node so there’s some sort of breakdown in pfsync there too.

The state will hang around on the original node since the app will die and it’ll just sit there in ESTABLISHED:ESTABLISHED for 24 hours.

At least in this test case. VLC gives up pretty quickly.

So 1. The state shouldn’t be killed. 2. if it is killed it should be deleted on the original node by pfsync.

Actions

Copy link

Updated by Chris Linstruth about 3 years ago

Subject changed from OpenVPN running on VIP on HA cluster causes duscruption of TCP streams during failover test to OpenVPN running on VIP on HA cluster causes disruption of TCP streams during failover

Actions

Copy link

Updated by Jim Pingle about 3 years ago

Subject changed from OpenVPN running on VIP on HA cluster causes disruption of TCP streams during failover to Restarting an OpenVPN server running on a CARP VIP in an HA cluster can disrupt unrelated TCP states

A few points here after working with cjl a bit trying to narrow it down:

The states that disappear are not directly related to the VPN (not traffic to/from the VPN or going through the VPN)
There are no log messages on either node indicating that states are being cleared, so none of the obvious code paths are being hit where it might kill or flush states
It also happens in certain cases when saving the VPN server, not just during a CARP transition
None of the options are enabled to kill states on gateway failure or IP address change
There are no logged auth failures / neither system is in the sshlockout table

Adjusted the subject to more closely align with the observed behavior

Actions

Copy link

Updated by Marcos M about 3 years ago

Additional notes while working with cjl:
Commenting out the line /sbin/pfctl -i $1 -Fs in /usr/local/sbin/ovpn-linkdown "fixes" the issue. However, the script should not be getting called unless the service has already started and stopped; in the case of a failover to the secondary, the service should not be running while in BACKUP status, hence the script shouldn't be executed.

Actions

Copy link

Updated by Florian Apolloner over 2 years ago

Hi there, I think I am seeing the same issue (on 23.05). I also do have OpenVPN on CARP IPs as of now (though openvpn might not be the only factor causing this). I'll add as much detail here as possible, let me know what you think.

My setup is like this:

client(192.168.11.43) -- (192.168.11.2 on vtnet2.511) pfsense1 (10.7.200.2 on vtnet2.192) -- target(10.7.200.12)
                       | (192.168.11.3 on vtnet2.511) pfsense2 (10.7.200.3 on vtnet2.192) |

So a pretty normal HA-Cluster. Client & Target are both in local networks, OpenVPN is not involved in the connections but is running. During a failover I see no errors anywhere but an SSH session from client to target hangs. Associated with that I see the following blocks in the filter log:

Jun  9 14:28:57 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.511,match,block,in,4,0x48,,64,41324,0,DF,6,tcp,124,192.168.11.43,10.7.200.12,51568,22,72,PA,1544295413:1544295485,73422640,501,,nop;nop;TS
Jun  9 14:30:07 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44921,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS
Jun  9 14:30:08 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44922,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS
Jun  9 14:30:08 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44923,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS
Jun  9 14:30:08 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44924,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS
Jun  9 14:30:09 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44925,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS
Jun  9 14:30:11 pfSense2 filterlog[94533]: 4,,,1000000103,vtnet2.192,match,block,in,4,0x48,,64,44926,0,DF,6,tcp,104,10.7.200.12,192.168.11.43,22,46976,52,PA,716626969:716627021,2743249057,501,,nop;nop;TS

This indicates to me that pfSense2 doesn't have the required states or is not able to match them up properly.

On the other hand, grepping for those states properly shows them on both firewalls:

pfctl -s states | grep '10.7.200.12:22'
all tcp 10.7.200.12:22 <- 10.7.22.120:51714       ESTABLISHED:ESTABLISHED
all tcp 10.7.22.120:51714 -> 10.7.200.12:22       ESTABLISHED:ESTABLISHED

Actions

Copy link

Updated by Florian Apolloner over 2 years ago

Marcos M wrote in #note-5:

Additional notes while working with cjl:
Commenting out the line /sbin/pfctl -i $1 -Fs in /usr/local/sbin/ovpn-linkdown "fixes" the issue. However, the script should not be getting called unless the service has already started and stopped; in the case of a failover to the secondary, the service should not be running while in BACKUP status, hence the script shouldn't be executed.

But this would executed on the master during failover, no? And whatever that does to states would get synced to the secondary?

As an extra note: If I fail back to the primary again, the SSH connections resumes working…

Actions

Copy link

Updated by Florian Apolloner over 2 years ago

I am able to reproduce the issue and I can also confirm that the issue is gone if I comment out /sbin/pfctl -i $1 -Fs. I can also confirm that the states actually vanish from the second firewall during the failover (so something must be actively deleting them?). Interestingly enough if I manually execute /sbin/pfctl -i ovpns1 -Fs it says 0 states cleared and nothing happens on the secondary firewall :/ I am somewhat out of ideas.

Actions

Copy link

Updated by Florian Apolloner over 2 years ago

Debugging even further this seems to be timing sensitive. If I run pfctl -i ovpns1 -Fs && pfSctl -c 'filter reload all' on the primary, then for a short time the state is missing on the standby. So maybe between the carp events, filter reload and flushing stuff on ovpns1 something weird happens. Is there a way to monitor state tables changes?

Actions

Copy link

#10

Updated by Marcos M over 2 years ago

Potentially related to https://redmine.pfsense.org/issues/11556

Actions

Copy link

#11

Updated by Florian Apolloner over 2 years ago

I don't think those two are related.

Actions

Copy link

#12

Updated by Sebastiano Degan about 2 years ago

Same issue on pfsense 2.7,
I confirm that

commenting out the line /sbin/pfctl -i $1 -Fs

works

Actions

Copy link

#13

Updated by Azamat Khakimyanov 9 months ago

Tested on 24.11 and 25.03-RC

This issue hasn't been fixed yet. I still see TCP traffic discruption during HA failover if there is OpenVPN Server created with WAN CARP VIP as an Interface.

And yes, commenting out the line /sbin/pfctl -i $1 -Fs in /usr/local/sbin/ovpn-linkdown does fix this issue
(https://redmine.pfsense.org/issues/13569#note-5)

Actions

Copy link

#14

Updated by Bernhard Schmidt 5 months ago

Also a problem on pfSense CE 2.8.0

Actions

Copy link

#15

Updated by Azamat Khakimyanov 3 months ago

Retested on 25.07.1 and issue hasn't been solved yet.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

pfSense » pfSense Plus

Custom queries

Bug #13569

Restarting an OpenVPN server running on a CARP VIP in an HA cluster can disrupt unrelated TCP states

Updated by Azamat Khakimyanov about 3 years ago

Updated by Chris Linstruth about 3 years ago

Updated by Chris Linstruth about 3 years ago

Updated by Jim Pingle about 3 years ago

Updated by Marcos M about 3 years ago

Updated by Florian Apolloner over 2 years ago

Updated by Florian Apolloner over 2 years ago

Updated by Florian Apolloner over 2 years ago

Updated by Florian Apolloner over 2 years ago

Updated by Marcos M over 2 years ago

Updated by Florian Apolloner over 2 years ago

Updated by Sebastiano Degan about 2 years ago

Updated by Azamat Khakimyanov 9 months ago

Updated by Bernhard Schmidt 5 months ago

Updated by Azamat Khakimyanov 3 months ago