Project

General

Profile

Actions

Bug #13569

open

Restarting an OpenVPN server running on a CARP VIP in an HA cluster can disrupt unrelated TCP states

Added by Azamat Khakimyanov 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
CARP
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Release Notes:
Default
Affected Plus Version:
22.05
Affected Architecture:

Description

Our customer (Ticket #1161128024) pointed out on possible problem with HA cluster and TCP streams. During troubleshooting customer found out that having OpenVPN Server running on VIP (WAN CARP VIP or IP Alias bundled with WAN CARP VIP) causes this issue: during failover all TCP streams break down.

I was able to reproduce this issue:
- HA cluster
- OpenVPN Server with WAN CARP VIP as an Interface
- downloading process (FreeBSD image) and TCP stream (VLC with Network Stream: http://webcam.rhein-taunus-krematorium.de/mjpg/video.mjpg) running on internal host
- active RA OpenVPN connection from external host

Putting Primary into Persistent CARP Maintenance mode destroyed both downloading process and TCP stream:
- System log on Primary
Oct 17 08:08:24 check_reload_status 389 Reloading filter
Oct 17 08:08:24 kernel ovpns1: link state changed to DOWN
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: XMLRPC reload data success with https://10.10.99.2:443/xmlrpc.php (pfsense.restore_config_section).
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: Beginning XMLRPC sync data to https://10.10.99.2:443/xmlrpc.php.
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: XMLRPC versioncheck: 22.7 -- 22.7
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: XMLRPC reload data success with https://10.10.99.2:443/xmlrpc.php (pfsense.host_firmware_version).
Oct 17 08:08:21 php-fpm 5328 /rc.filter_synchronize: Beginning XMLRPC sync data to https://10.10.99.2:443/xmlrpc.php.
Oct 17 08:08:21 check_reload_status 389 Carp backup event
Oct 17 08:08:21 check_reload_status 389 Carp backup event
Oct 17 08:08:21 kernel carp: 3@vtnet1: MASTER -> BACKUP (more frequent advertisement received)
Oct 17 08:08:21 kernel carp: 2@vtnet0: MASTER -> BACKUP (more frequent advertisement received)
Oct 17 08:08:21 kernel carp: 4@vtnet2: MASTER -> BACKUP (more frequent advertisement received)
Oct 17 08:08:21 check_reload_status 389 Carp backup event
Oct 17 08:08:20 check_reload_status 389 Syncing firewall
Oct 17 08:08:20 php-fpm 5328 /status_carp.php: Configuration Change: (Local Database): Enter CARP maintenance mode

- System log on Secondary node
Oct 17 08:08:25 php-fpm 359 /rc.start_packages: Restarting/Starting all packages.
Oct 17 08:08:24 check_reload_status 389 Starting packages
Oct 17 08:08:24 check_reload_status 389 Reloading filter
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - -> 172.27.240.1 - Restarting packages.
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: rc.newwanip called with empty interface.
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: rc.newwanip: on (IP address: 172.27.240.1) (interface: []) (real interface: ovpns1).
Oct 17 08:08:24 php-fpm 359 /rc.newwanip: rc.newwanip: Info: starting on ovpns1.
Oct 17 08:08:23 check_reload_status 389 rc.newwanip starting ovpns1
Oct 17 08:08:23 check_reload_status 389 Reloading filter
Oct 17 08:08:23 kernel ovpns1: link state changed to UP
Oct 17 08:08:21 check_reload_status 389 Reloading filter
Oct 17 08:08:21 check_reload_status 389 Syncing firewall
Oct 17 08:08:21 php-fpm 62370 /xmlrpc.php: Configuration Change: (system)@10.10.99.1: Merged in config (staticroutes, gateways, virtualip, system, hasync, aliases, ca, cert, crl, dhcpd, dnshaper, filter, ipsec, nat, openvpn, schedules, shaper, unbound, wol sections) from XMLRPC client.
Oct 17 08:08:21 check_reload_status 389 Carp master event
Oct 17 08:08:21 check_reload_status 389 Carp master event
Oct 17 08:08:21 kernel arp: 10.10.130.1 moved from 00:00:5e:00:01:04 to 52:54:00:33:21:c0 on vtnet2
Oct 17 08:08:21 kernel carp: 2@vtnet0: BACKUP -> MASTER (preempting a slower master)
Oct 17 08:08:21 kernel carp: 3@vtnet1: BACKUP -> MASTER (preempting a slower master)
Oct 17 08:08:21 kernel carp: 4@vtnet2: BACKUP -> MASTER (preempting a slower master)
Oct 17 08:08:21 check_reload_status 389 Carp master event

Actions #1

Updated by Azamat Khakimyanov 4 months ago

forgot to add: without OpenVPN running on VIP or even with OpenVPN runnning on WAN, there is no problem with TCP streams during HA failover test.

Actions #2

Updated by Chris Linstruth 4 months ago

Verified. Running OpenVPN server bound to Localhost and port forwarding an IP Alias/CARP VIP to it looks like a reasonable workaround for now.

The state vanishes from the cluster member being failed to.

It remains active on the other node so there’s some sort of breakdown in pfsync there too.

The state will hang around on the original node since the app will die and it’ll just sit there in ESTABLISHED:ESTABLISHED for 24 hours.

At least in this test case. VLC gives up pretty quickly.

So 1. The state shouldn’t be killed. 2. if it is killed it should be deleted on the original node by pfsync.

Actions #3

Updated by Chris Linstruth 4 months ago

  • Subject changed from OpenVPN running on VIP on HA cluster causes duscruption of TCP streams during failover test to OpenVPN running on VIP on HA cluster causes disruption of TCP streams during failover
Actions #4

Updated by Jim Pingle 4 months ago

  • Subject changed from OpenVPN running on VIP on HA cluster causes disruption of TCP streams during failover to Restarting an OpenVPN server running on a CARP VIP in an HA cluster can disrupt unrelated TCP states

A few points here after working with cjl a bit trying to narrow it down:

  • The states that disappear are not directly related to the VPN (not traffic to/from the VPN or going through the VPN)
  • There are no log messages on either node indicating that states are being cleared, so none of the obvious code paths are being hit where it might kill or flush states
  • It also happens in certain cases when saving the VPN server, not just during a CARP transition
  • None of the options are enabled to kill states on gateway failure or IP address change
  • There are no logged auth failures / neither system is in the sshlockout table

Adjusted the subject to more closely align with the observed behavior

Actions #5

Updated by Marcos M 4 months ago

Additional notes while working with cjl:
Commenting out the line /sbin/pfctl -i $1 -Fs in /usr/local/sbin/ovpn-linkdown "fixes" the issue. However, the script should not be getting called unless the service has already started and stopped; in the case of a failover to the secondary, the service should not be running while in BACKUP status, hence the script shouldn't be executed.

Actions

Also available in: Atom PDF