Bug #1235
closedpfsense 2.0 load balancing with a https monitor seems to default timeout 200ms causing constant timeouts
0%
Description
Hi,
I've been failing with load balancing between two https web servers on pfsense 2.0.
I have two https servers in a pool, port 443, with a custom monitor. The custom monitor is configured like so:
Type: HTTPS
Path: /some/url
Host: some.host.tld
HTTP Code: 200
Once I bring my webservers up, I see requests from the pfsense boxes and I see both my web servers responding with 200's... all good.
However, my load balancers are mostly marked as down, occassionally they will become active for a few seconds, but then they go down again.
A snippet from the logs suggest:
Jan 27 14:39:43 relayd[40896]: host 10.105.255.22, check http code use ssl (180ms), state down -> up, availability 0.43% Jan 27 14:39:43 relayd[40618]: pfe_dispatch_imsg: state 1 for host 4 10.105.255.22 Jan 27 14:39:43 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:39:43 relayd[40896]: host 10.105.255.21, check http code use ssl (210ms), state up -> down, availability 0.46% Jan 27 14:39:43 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:39:43 relayd[40896]: hce_notify_done: 10.105.255.22 (ssl read timeout) Jan 27 14:39:43 relayd[40618]: pfe_dispatch_imsg: state -1 for host 3 10.105.255.21 Jan 27 14:39:53 relayd[40618]: table TIM_LB2_443: 1 added, 1 deleted, 0 changed, 0 killed Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.22 (http code ok) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.22 (ssl read timeout) Jan 27 14:39:53 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.22 (http code ok) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.21 (http code ok) Jan 27 14:40:03 relayd[40896]: host 10.105.255.21, check http code use ssl (186ms), state down -> up, availability 0.42% Jan 27 14:40:03 relayd[40618]: pfe_dispatch_imsg: state 1 for host 1 10.105.255.21 Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:40:03 relayd[40896]: hce_notify_done: 10.105.255.22 (ssl read timeout) Jan 27 14:40:13 relayd[40618]: table TIM_LB1_443: 1 added, 0 deleted, 0 changed, 0 killed Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:40:13 relayd[40618]: pfe_sync: enabling ruleset Jan 27 14:40:13 relayd[40618]: sync_ruleset: rule added to rdr-anchor "relayd/TIM_LB1_443" Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.21 (tcp connect ok) Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.22 (tcp connect ok) Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.22 (ssl read timeout) Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:40:13 relayd[40896]: host 10.105.255.21, check http code use ssl (205ms), state up -> down, availability 0.42% Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.22 (ssl read timeout) Jan 27 14:40:13 relayd[40896]: host 10.105.255.22, check http code use ssl (205ms), state up -> down, availability 0.44% Jan 27 14:40:13 relayd[40896]: hce_notify_done: 10.105.255.21 (ssl read timeout) Jan 27 14:40:13 relayd[40618]: pfe_dispatch_imsg: state -1 for host 1 10.105.255.21 Jan 27 14:40:13 relayd[40618]: pfe_dispatch_imsg: state -1 for host 4 10.105.255.22
Ok... this almost looks like less than 200ms response relayd takes the server up, greater than 200ms it takes it down again.
Further investigation suggests that relayd checks default to 200ms.... which is slightly cunning, because if I look at the /var/etc/relayd.conf file:
log updates table <WEB_443> { 10.105.255.21, 10.105.255.22 } redirect "LB1_443" { listen on 10.105.255.23 port 443 forward to <WEB_443> port 443 check https '/some/url' host some.host.tld code 200 timeout 1000 }
Which suggest that a timeout of 1000ms is being set... unfortunately it doesn't seem to have the desired effect?! At least, it dosen't seem to be having the desired effect as my servers keep dropping out of the pool after > 200ms response
For what it's worth a second load balancer that simply does tcp check on a port, that returns in less than 200ms every single time works perfectly.
Updated by John Welter over 13 years ago
I have confirmed this issue. The default relayd timeout of 200ms is being used even with a timeout specified inside the redirect stanza. Adding timeout 1000 right after the 'log updates' (the global section of the config) works to increase the timeout for all load balance pools in the configuration.
Updated by John Welter over 13 years ago
Did some quick tests: The timeout option is only valid as a global config or in the table stanza, not in a redirect or relay. If it is not in the global section or in a table stanza it is ignored, silently.
The following works:
log updates
table <WEB_443> { 10.105.255.21, 10.105.255.22 timeout 1000 }
redirect "LB1_443" {
listen on 10.105.255.23 port 443
forward to <WEB_443> port 443 check https '/some/url' host some.host.tld code 200 timeout 1000
}
The following does not:
log updates
table <WEB_443> { 10.105.255.21, 10.105.255.22 }
redirect "LB1_443" {
listen on 10.105.255.23 port 443
forward to <WEB_443> port 443 check https '/some/url' host some.host.tld code 200 timeout 1000
}
Hope this helps. When a patch is available I can easily test it.
Updated by Chris Buechler over 13 years ago
- Priority changed from High to Normal
- Target version set to 2.0
Updated by Warren Baker over 13 years ago
- Status changed from New to Feedback
I have added a patch for this but for now made this a global option. We should, at a later stage add the option for configurable timeouts.
Changeset applied in revision 4a916dc89b36b92ba1a7076af04b157ca88990d7
Updated by Chris Buechler over 13 years ago
- Status changed from Feedback to Resolved