Bug #1235
closedpfsense 2.0 load balancing with a https monitor seems to default timeout 200ms causing constant timeouts
0%
Description
Hi,
I've been failing with load balancing between two https web servers on pfsense 2.0.
I have two https servers in a pool, port 443, with a custom monitor. The custom monitor is configured like so:
Type: HTTPS
Path: /some/url
Host: some.host.tld
HTTP Code: 200
Once I bring my webservers up, I see requests from the pfsense boxes and I see both my web servers responding with 200's... all good.
However, my load balancers are mostly marked as down, occassionally they will become active for a few seconds, but then they go down again.
A snippet from the logs suggest:
Jan 27 14:39:43 relayd[40896]: host 10.105.255.22, check http code use ssl (180ms), state down -> up, availability 0.43% Jan 27 14:39:43 relayd[40618]: pfe_dispatch_imsg: state 1 for host 4 10.105.255.22 Jan 27 14:39:43 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:39:43 relayd[40896]: host 10.105.255.21, check http code use ssl (210ms), state up -> down, availability 0.46% Jan 27 14:39:43 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:39:43 relayd[40896]: hce_notify_done: 10.105.255.22 (ssl read timeout) Jan 27 14:39:43 relayd[40618]: pfe_dispatch_imsg: state -1 for host 3 10.105.255.21 Jan 27 14:39:53 relayd[40618]: table TIM_LB2_443: 1 added, 1 deleted, 0 changed, 0 killed Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.22 (http code ok) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.22 (ssl read timeout) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.22 (http code ok) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.21 (http code ok) Jan 27 14:40:03 relayd[40896]: host 10.105.255.21, check http code use ssl (186ms), state down -> up, availability 0.42% Jan 27 14:40:03 relayd[40618]: pfe_dispatch_imsg: state 1 for host 1 10.105.255.21 Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.22 (ssl read timeout) Jan 27 14:40:13 relayd[40618]: table TIM_LB1_443: 1 added, 0 deleted, 0 changed, 0 killed Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:40:13 relayd[40618]: pfe_sync: enabling ruleset Jan 27 14:40:13 relayd[40618]: sync_ruleset: rule added to rdr-anchor "relayd/TIM_LB1_443" Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.22 (ssl read timeout) Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:40:13 relayd[40896]: host 10.105.255.21, check http code use ssl (205ms), state up -> down, availability 0.42% Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.22 (ssl read timeout) Jan 27 14:40:13 relayd[40896]: host 10.105.255.22, check http code use ssl (205ms), state up -> down, availability 0.44% Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:40:13 relayd[40618]: pfe_dispatch_imsg: state -1 for host 1 10.105.255.21 Jan 27 14:40:13 relayd[40618]: pfe_dispatch_imsg: state -1 for host 4 10.105.255.22
Ok... this almost looks like less than 200ms response relayd takes the server up, greater than 200ms it takes it down again.
Further investigation suggests that relayd checks default to 200ms.... which is slightly cunning, because if I look at the /var/etc/relayd.conf file:
log updates table <WEB_443> { 10.105.255.21, 10.105.255.22 } redirect "LB1_443" { listen on 10.105.255.23 port 443 forward to <WEB_443> port 443 check https '/some/url' host some.host.tld code 200 timeout 1000 }
Which suggest that a timeout of 1000ms is being set... unfortunately it doesn't seem to have the desired effect?! At least, it dosen't seem to be having the desired effect as my servers keep dropping out of the pool after > 200ms response
For what it's worth a second load balancer that simply does tcp check on a port, that returns in less than 200ms every single time works perfectly.