Project

General

Profile

Actions

Regression #15011

closed

ISC DHCP responds from a random port

Added by Mathias Ringhof 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
DHCP (IPv4)
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
23.09.1
Release Notes:
Default
Affected Version:
2.7.1
Affected Architecture:
All

Description

After upgrading from 23.05.1 to 23.09 I'm seeing a significant jump in log messages from the ISC DHCP server, mostly constant DHCPREQUEST/DHCPACK chains.
I've been looking for other users feedback on the Netgate forums and reddit as well:
https://forum.netgate.com/topic/184104/dhcp-weirdness-after-23-09-upgrade/5
https://www.reddit.com/r/PFSENSE/comments/17xi9ut/isc_dhcp_server_in_2309/?sort=new

Some examples, Windows Server 2019, happens every couple of hours:
https://gist.github.com/mathiasringhof/0558d7e704b2c791be02c7ac4f7bfd16

Others like this IoT device generate less but still hundreds of messages, sometimes ending in DHCPRELEASE (not found):
https://gist.github.com/mathiasringhof/a64104ef73ae1a5c15a3c35283d83266

I've checked the generated DHCPd config and it's the same between 23.05 and 23.09. All settings are on defaults except:
- custom DHCP option 43 for one VLAN
- ranges from .100 to .200 for all 3 VLANs
- static DHCP leases for all 3 VLANs

Reverting back to 23.05 through boot environment change immediately fixes the problem.

Actions #1

Updated by Mathias Ringhof 7 months ago

EDIT: here is a full 24h of the same Windows server DHCP request with 23.05:
https://gist.github.com/mathiasringhof/ce98c51358768695b5016a1b8527a7ac

Actions #2

Updated by Douglas Hoffman 7 months ago

This appears to be related to a change in behavior with the source port being used by isc dhcpd in 23.09/2.7.1.

Comparing packet captures before and after, dhcpd uses source port 67 for ACKs when it works and a random port when it doesn't. It then appears some clients will ignore the response from a different source port, leading to the repeated attempts to renew. Affected clients will continue to unicast REQUESTs until the lease expires, give up, then broadcast a DISCOVER or REQUEST without a client IP, which succeeds. This can be disruptive since affected clients release the IP briefly each time the lease expires.

I also found a report of someone experiencing issues using DHCP relay on 23.09 to an external Kea server, and they're seeing the same source port behavior: https://lists.isc.org/pipermail/kea-users/2023-November/004408.html. Kea running on pfSense is not affected.

Actions #3

Updated by Mathias Ringhof 7 months ago

I cross-checked the binaries MD5sums since configuration etc seems to be the same and indeed there's a difference between 23.09 and 23.05.1 even though both are version isc-dhcpd-4.4.3-P1:
23.09: fd03d9fc52657188b8e44e954438d7b6 /usr/local/sbin/dhcpd
23.05: 1c5b7e8f3d202198186306d9effa32b5 /usr/local/sbin/dhcpd

I've checked their configuration documentation and there seems to be no configuration option to change that behaviour so my assumption is that this is something that can be configured at build time.

For testing I've replaced the 23.09 binary with the 23.05.1 binary in /usr/local/sbin

Actions #4

Updated by Mathias Ringhof 7 months ago

Confirmed that using the 23.05.1 binary in 23.09 I'm seeing the "old" behaviour where the Windows Server doesn't spam the logs and the IoT device doesn't lose its IP address. I haven't experienced any negative side effects so far either, keeping in mind this is only running for less than 12h.

I've had a look at the "configure" file of the ISC DHCP but couldn't find any option that would explain the different behaviour.

Actions #5

Updated by Douglas Hoffman 7 months ago

It's possible to specify the local port using the -p command line argument and the DHCPD_PORT env variable. Neither work around this problem.

[2.7.1-RELEASE][admin@pfSense.home.arpa]/root: /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid em1 -f -d -p 67
binding to user-specified port 67
Internet Systems Consortium DHCP Server 4.4.3-P1
Copyright 2004-2022 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/
Config file: /etc/dhcpd.conf
Database file: /var/db/dhcpd.leases
PID file: /var/run/dhcpd.pid
Wrote 0 class decls to leases file.
Wrote 2 leases to leases file.
Listening on BPF/em1/00:1c:42:73:d5:df/192.168.1.0/24
Sending on   BPF/em1/00:1c:42:73:d5:df/192.168.1.0/24
Sending on   Socket/fallback/fallback-net
Server starting service.
[2.7.1-RELEASE][admin@pfSense.home.arpa]/root: sockstat -l | grep ^dhcpd
dhcpd    dhcpd      62241 8   udp4   *:51098               *:*
[2.7.1-RELEASE][admin@pfSense.home.arpa]/root: setenv DHCPD_PORT 67
[2.7.1-RELEASE][admin@pfSense.home.arpa]/root: /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid em1 -f -d
Internet Systems Consortium DHCP Server 4.4.3-P1
Copyright 2004-2022 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/
binding to environment-specified port 67
Config file: /etc/dhcpd.conf
Database file: /var/db/dhcpd.leases
PID file: /var/run/dhcpd.pid
Wrote 0 class decls to leases file.
Wrote 2 leases to leases file.
Listening on BPF/em1/00:1c:42:73:d5:df/192.168.1.0/24
Sending on   BPF/em1/00:1c:42:73:d5:df/192.168.1.0/24
Sending on   Socket/fallback/fallback-net
Server starting service.
[2.7.1-RELEASE][admin@pfSense.home.arpa]/root: sockstat -l | grep ^dhcpd
dhcpd    dhcpd      89859 7   udp4   *:17260               *:*

Nothing else is using the port:

[2.7.1-RELEASE][admin@pfSense.home.arpa]/root: sockstat -P udp -p 67
USER     COMMAND    PID   FD  PROTO  LOCAL ADDRESS         FOREIGN ADDRESS
[2.7.1-RELEASE][admin@pfSense.home.arpa]/root:

I tried removing the user argument so it runs as root and setting net.inet.ip.portrange.reservedhigh to 0, no change.

Running dhcpd under truss, I see that it's not calling bind() in 2.7.1:

2.7.0:

socket(PF_INET,SOCK_DGRAM,IPPROTO_UDP)         = 7 (0x7)
setsockopt(7,SOL_SOCKET,SO_REUSEADDR,0x820520f0c,4) = 0 (0x0)
bind(7,{ AF_INET 0.0.0.0:67 },16)         = 0 (0x0)

2.7.1:

socket(PF_INET,SOCK_DGRAM,IPPROTO_UDP)         = 7 (0x7)
setsockopt(7,SOL_SOCKET,SO_REUSEADDR,0x82091288c,4) = 0 (0x0)

The bind() call appears to be dependent on some USE flags in the unmodified code, but socket() and setsockopt() are, as well, and they get called. If bind() is failing, we should get some log output indicating that failure. It seems it's just not getting called.

common/socket.c:

#if defined (USE_SOCKET_SEND) || \
    defined (USE_SOCKET_RECEIVE) || \
        defined (USE_SOCKET_FALLBACK)

[...snip...]

    /* Make a socket... */
    sock = socket(domain, SOCK_DGRAM, IPPROTO_UDP);
    if (sock < 0) {
        log_fatal("Can't create dhcp socket for %s: %m", info->name);
    }

    /* Set the REUSEADDR option so that we don't fail to start if
       we're being restarted. */
    flag = 1;
    if (setsockopt(sock, SOL_SOCKET, SO_REUSEADDR,
            (char *)&flag, sizeof(flag)) < 0) {
        log_fatal("Can't set SO_REUSEADDR on dhcp socket for" 
              " %s: %m", info->name);
    }

[...snip...]

    /* Bind the socket to this interface's IP address. */
    if (bind(sock, (struct sockaddr *)&name, name_len) < 0) {
        log_error("Can't bind to dhcp address: %m");
        log_error("Please make sure there is no other dhcp server");
        log_error("running and that there's no entry for dhcp or");
        log_error("bootp in /etc/inetd.conf.   Also make sure you");
        log_error("are not running HP JetAdmin software, which");
        log_fatal("includes a bootp server.");
    }

I'm also seeing different checksums between my 2.7.0 and 2.7.1 test environments:

[2.7.0-RELEASE][admin@pfSense.home.arpa]/root: md5sum /usr/local/sbin/dhcpd
0f548aeb4f409f63894f03f93129f7b6  /usr/local/sbin/dhcpd
[2.7.1-RELEASE][admin@pfSense.home.arpa]/root: md5sum /usr/local/sbin/dhcpd
fd03d9fc52657188b8e44e954438d7b6  /usr/local/sbin/dhcpd

Replacing the binary with the one from 2.7.0 fixes this problem:

[2.7.1-RELEASE][admin@pfSense.home.arpa]/root: md5sum /usr/local/sbin/dhcpd
0f548aeb4f409f63894f03f93129f7b6  /usr/local/sbin/dhcpd
[2.7.1-RELEASE][admin@pfSense.home.arpa]/root: sockstat -P udp -p 67
USER     COMMAND    PID   FD  PROTO  LOCAL ADDRESS         FOREIGN ADDRESS
dhcpd    dhcpd      38621 10  udp4   *:67                  *:*

It seems some change was made affecting the bind() call in 23.09/2.7.1.

Actions #6

Updated by Douglas Hoffman 7 months ago

I think I found the problem...

https://github.com/pfsense/FreeBSD-ports/commit/3b827dc6cfe3aebec16332b6494cb2742757ac71

+/*
+ * We only need to bind the socket for the DHCPv6 case.
+ * DHCPv4 receives exlusively using the packet filter and only
+ * uses this UDP socket for sending routed (with ARP) unicast.
+ * This is the so-called, "fallback" interface. There is no
+ * need in binding INADDR_ANY for DHCPv4.
+ */
+#if defined(DHCPv6)
     /* Bind the socket to this interface's IP address. */
-    if (bind(sock, (struct sockaddr *)&name, name_len) < 0) {
+    if ((local_family == AF_INET6) &&
+        bind(sock, (struct sockaddr *)&name, name_len) < 0) {
         log_error("Can't bind to dhcp address: %m");
         log_error("Please make sure there is no other dhcp server");
         log_error("running and that there's no entry for dhcp or");
@@ -263,6 +272,7 @@ if_register_socket(struct interface_info *info, int fa
         log_error("are not running HP JetAdmin software, which");
         log_fatal("includes a bootp server.");
     }
+#endif

It looks like common/socket.c is intentionally patched to only call bind() for IPv6, so that an IPv4 server and IPv4 relay can run concurrently.

Actions #7

Updated by Jim Pingle 7 months ago

  • Project changed from pfSense Plus to pfSense
  • Category changed from DHCP Server (IPv4) to DHCP (IPv4)
  • Assignee set to Christian McDonald
  • Target version set to 2.8.0
  • Affected Plus Version deleted (23.09)
  • Plus Target Version set to 24.03
Actions #8

Updated by name name 7 months ago

I don't know if it is related, but my HA setup, where the backup pfSense is offline due to a hardware defect, didn't give out any DHCP leases after upgrading to 23.09 until I removed the Failover peer ip from each DHCP VLAN/Interface configuration. It basically just stopped working, I didn't get an IP using my iphone, laptop, workstation, ... The only thing that had internet connection were the systems that had statically assigned addresses.

EDIT: The state of the master pfSense DHCP leases overview showed "recover" (if I remember correctly) for each instance. And I believe the peer state was shown as unknown.

Actions #9

Updated by Marcos M 7 months ago

  • Subject changed from ISC DHCP server generates tons of DHCPREQUEST/DHCPACK messages in 23.09 vs 23.05 causing some devices to lose IP address to ISC DHCP responds from a random port
  • Priority changed from Normal to High
  • Affected Version set to 2.7.1

Looks like the standard leaves it up to the client:

The time over which the client collects messages and the mechanism used to select one DHCPOFFER are implementation dependent.

https://datatracker.ietf.org/doc/rfc2131/

From what I've seen, many clients assume the communication will happen on the standard ports. With that in mind, the previous behavior should be the default and the new behavior (if kept) should be an optional flag when starting the service. Considering that ISC DHCP is deprecated and there seem to be other issues with this (see #14620), we may simply wish to restore the old behavior.

Actions #10

Updated by Ansley Barnes 7 months ago

Marcos M wrote in #note-9:

Looks like the standard leaves it up to the client:

The time over which the client collects messages and the mechanism used to select one DHCPOFFER are implementation dependent.

https://datatracker.ietf.org/doc/rfc2131/

From what I've seen, many clients assume the communication will happen on the standard ports. With that in mind, the previous behavior should be the default and the new behavior (if kept) should be an optional flag when starting the service. Considering that ISC DHCP is deprecated and there seem to be other issues with this (see #14620), we may simply wish to restore the old behavior.

Is there a way to restore the old behavior via a configuration option or patch? From the thread https://forum.netgate.com/topic/184104/dhcp-weirdness-after-23-09-upgrade it seems like people are solving it by reverting the binary but if it could be accomplished via something a little cleaner that'd be ideal. I can't switch to Kea.

Actions #11

Updated by Douglas Hoffman 7 months ago

Although RFC2131 doesn't specifically dictate the source ports, RFC8357 was proposed to address the requirement to use fixed source ports in relays:

RFC2131:

DHCP uses UDP as its transport protocol. DHCP messages from a client to a server are sent to the 'DHCP server' port (67), and DHCP messages from a server to a client are sent to the 'DHCP client' port (68).

RFC8357:

This document defines an extension to relax the fixed UDP source port requirement for the DHCP relay agents. This extension requires a DHCP server to remember the inbound packet's UDP port number along with the IPv4/IPv6 address. When sending back replies, the DHCP server MUST use the UDP port number that the incoming relay agent uses instead of the fixed DHCP port number.

This seems to imply the intent of RFC2131 was that a fixed source port be used, at least for relays. The new behavior doesn't comply with this.

It's also generally expected that, although UDP is connectionless, exchanges between client and server will continue to use the same port pair. This obviously isn't a hard requirement, only dictated by the protocol itself, but this new behavior broke a number of clients that were apparently designed with that expectation in mind. This isn't just one poorly designed client – right now, ISC DHCP on pfSense 23.09 is unusable for me, as most of the devices on my network were experiencing problems of some sort.

If pfSense wants to support mixed use of DHCP servers and relays, it seems it should do so by binding to specific IPs on the RFC-defined port, rather than not binding to the correct port at all.

Actions #12

Updated by Marcos M 7 months ago

  • Status changed from New to Pull Request Review
  • Assignee changed from Christian McDonald to Marcos M
Actions #13

Updated by Marcos M 7 months ago

  • Status changed from Pull Request Review to Feedback
Actions #14

Updated by Marcos M 7 months ago

Before the fix, DHCP logs show the following when a client tries to renew (logs reversed):

Nov 27 11:51:33     dhcpd     86033     DHCPACK on 10.0.5.50 to a8:a1:59:xx:xx:xx via vmx0.5
Nov 27 11:51:33     dhcpd     86033     DHCPREQUEST for 10.0.5.50 from a8:a1:59:xx:xx:xx via vmx0.5
Nov 27 11:51:29     dhcpd     86033     DHCPACK on 10.0.5.50 to a8:a1:59:xx:xx:xx via vmx0.5
Nov 27 11:51:29     dhcpd     86033     DHCPREQUEST for 10.0.5.50 from a8:a1:59:xx:xx:xx via vmx0.5
Nov 27 11:51:27     dhcpd     86033     DHCPACK on 10.0.5.50 to a8:a1:59:xx:xx:xx via vmx0.5
Nov 27 11:51:27     dhcpd     86033     DHCPREQUEST for 10.0.5.50 from a8:a1:59:xx:xx:xx via vmx0.5
Nov 27 11:51:24     dhcpd     86033     DHCPACK on 10.0.5.50 to a8:a1:59:xx:xx:xx via vmx0.5
Nov 27 11:51:24     dhcpd     86033     DHCPREQUEST for 10.0.5.50 from a8:a1:59:xx:xx:xx via vmx0.5 

... and after the fix:

Nov 27 12:16:01     dhcpd     24850     DHCPACK on 10.0.5.50 to a8:a1:59:xx:xx:xx via vmx0.5
Nov 27 12:16:01     dhcpd     24850     DHCPREQUEST for 10.0.5.50 from a8:a1:59:xx:xx:xx via vmx0.5

The fix will be included in 23.09/2.7.1 in the next ports build, after which running pfSense-repoc; pkg upgrade will pick it up (after updating, restart the service or reboot).

Actions #15

Updated by Marcos M 7 months ago

  • Status changed from Feedback to Resolved
Actions #16

Updated by Jim Pingle 7 months ago

  • Target version changed from 2.8.0 to 2.7.2
  • Plus Target Version changed from 24.03 to 23.09.1
Actions

Also available in: Atom PDF