WAN doesn't reconnect on dropped PPPoE session
I've been troubleshooting this issue for about a month solidly now, and am certain it's a bug after swapping out everything else. Scenario is:
Dell Poweredge T320 with onboard dual Broadcom 5720 NICs, and dual Intel i350 NICs on a PCI card. Running VMware ESXi 6.0U2.
pfSense is the only VM on this server, and is given 2Gb RAM and 10 Gb disk and 1 vCPU. It has 1 x vNIC associated to "LAN" and 1 x vNIC associated to "WAN".
LAN is a vSwitch with nothing else on it local to the ESXi box but a few devices on the LAN (VoIP handsets, Ruckus AP, Wyse term, etc), and has a private IP address range of 192.168.31.0/24
WAN is a vSwitch with a single drop cable into a Draytek Vigor 130 ADSL/VDSL modem
ISP is ICUK, a medium size British ISP, and the line is an ADSL line with 5mbps/1mpbs provided by BT.
There is also an IPSec site to site link configured against another pfSense instance (2.2.6) at head office.
pfSense started out as being version 2.3, this exhibited the issue, then was updated to version 2.3_1, this also exhibited the issue, and the whole VM was then rebuilt with version 2.3.1 which ALSO exhibits the issue. I'm not using any packages other than OpenVMTools (which I didn't start out using until 2.3.1 and it's not affected the issue at all).
Essentially, the problem is that when started, pfSense successfully negotiates a PPPoE connection with the ISP's RADIUS servers. This connection works fine for a while, but within a day of the connection being live it will be brought down and will not renegotiate/redial to become live again.
- Changing modem
- Changing NICs from Broadcom to Intel
- Rebuilding new VM
- Changing cabling
- Adding a periodic reset (this was successful possibly 1 or 2 times in total but wholly unsuccessful otherwise)
- Changing the WAN to dial-on-demand mode
Literally almost every change I make to the WAN config, the line goes down and won't come back up, and the only successful way it appears that I can get the WAN up are as follows:
- Restart pfSense
- Unplug the ethernet cable from the modem, wait 10 seconds, plug the ethernet cable back into the modem
- Restart the modem (this only appears to work maybe 25% of the time)
As far as the logs go, I had a Reddit post here about the issue which includes the logs: https://www.reddit.com/r/PFSENSE/comments/4kn981/dpinger_behaviour/
Also, the ISP have confirmed that when the PPPoE is down the modem is in sync with the line, and can be for several days without pfSense redialling (or at least the redials getting through to the modem/ISP).
#1 Updated by Chris Buechler over 3 years ago
- Subject changed from Dpinger doesn't fully restart WAN interface on dropped PPPoE session to WAN doesn't reconnect on dropped PPPoE session
- Category changed from Gateway Monitoring to Interfaces
- Status changed from New to Feedback
- Priority changed from Very High to Normal
Gateway monitoring has no relation to PPPoE reconnection.
mpd is retrying over and over to connect in your logs. As soon as it has the first indication that something is live and replying to you, it connects successfully. Doubt this is a bug (in anything we have any control over). The fact that just unplugging and replugging the modem resolves it, which has no impact whatsoever on the VM, strongly suggests the problem exists elsewhere.
You'll need to capture traffic from multiple reference points to see what's happening. A SPAN port between the ESX host and the modem, and something on the same vswitch as the modem on a promiscuous port group would be the two places to start.
#2 Updated by Xander Venterus over 3 years ago
Just a piece of outside advise, do not rule out the Draytek, im a Network Engineer with 24 Certification, i have seen a lot of strange issues revolving around Draytek hardware, including frequent drops of PPPoE connections, or just VPN tunnels over PPPoE, sudden unexpected reboots, once in a blue moon random total hang of the modem to the point you cant even access it without rebooting it. It may be in your best interest to try a completely different brand of modem.
Make sure the modem is set to MTU 1492, perhaps also set the interface leading to the draytek to that MTU as well, it may help, and has helped in many cases for me, tho the sure fire fix has usually been to replace the Draytek unit with ANY other brand.
#4 Updated by Michael Knowles over 3 years ago
I actually visited site today to deal with pfSense. I have a couple of interesting observations and did get partway towards writing a reply to this at lunchtime, but saved it for posterity as I tried one last thing...
Essentially, when on the LAN I logged in to pfSense and found 2 problems with it:
1) The WAN interface setting had got itself lost somehow - i.e. when going into the logs I found the following excerpt:
May 31 13:38:04 kernel uhub2: <VMware Virtual USB Hub> on usbus0
May 31 13:38:04 kernel Root mount waiting for: usbus0
May 31 13:38:04 kernel uhub2: 7 ports with 7 removable, self powered
May 31 13:38:04 kernel Trying to mount root from ufs:/dev/ufsid/5745933d1e58e601 [rw]...
May 31 13:38:04 kernel WARNING: / was not properly dismounted
May 31 13:38:04 kernel vmx1: link state changed to UP
May 31 13:38:04 kernel pflog0: promiscuous mode enabled
May 31 13:38:04 php-cgi rc.bootup: The command '/sbin/ifconfig 'pppoe0' -staticarp ' returned exit code '1', the output was 'ifconfig: interface pppoe0 does not exist'
May 31 13:38:04 php-cgi rc.bootup: The command '/usr/sbin/arp -d -i 'pppoe0' -a > /dev/null 2>&1 ' returned exit code '1', the output was ''
May 31 13:38:04 php-cgi rc.bootup: Resyncing OpenVPN instances.
May 31 13:38:04 check_reload_status Linkup starting vmx1
May 31 13:38:04 xinetd 7522 xinetd Version 2.3.15 started with libwrap loadavg options compiled in.
May 31 13:38:04 xinetd 7522 Started working: 1 available service
May 31 13:38:04 kernel .done.
May 31 13:38:04 kernel done.
This was when I asked the on-site lady to restart ESXi for me. As you can see, it can't find the pppoe0 interface.
I then went into Interfaces/WAN/Advanced and MLPPP, and found that whilst it listed both vmx0 and vmx1 for interfaces, no interface was highlighted. I highlighted vmx0, hit save, and BOOM! internet was back.
Now, since I was on the WAN side of this machine when it last lost it's internet connectivity, and nothing had happened since then aside from me asking the lady to restart ESXi, then this has to be some sort of a bug, because it wouldn't have worked had the interface not been highlighted previously.
The second problem was shown in the PPP logs. Here's the excerpt:
May 31 10:59:50 ppp [wan_link0] LCP: Down event
May 31 10:59:50 ppp [wan_link0] LCP: state change Closed --> Initial
May 31 10:59:52 ppp [wan] Bundle: Shutdown
May 31 10:59:52 ppp [wan_link0] Link: Shutdown
May 31 10:59:52 ppp process 36152 terminated
Jun 9 10:29:27 ppp Multi-link PPP daemon for FreeBSD
Jun 9 10:29:27 ppp process 54097 started, version 5.8 (root@pfSense_v2_3_1_amd64-pfSense_v2_3_1-job-13 19:20 16-May-2016)
Jun 9 10:29:27 ppp web: web is not running
Jun 9 10:29:27 ppp [wan] Bundle: Interface ng0 created
Jun 9 10:29:27 ppp [wan_link0] Link: OPEN event
Essentially, no PPP process was running between those two sets of dates. This is obviously an issue if you're trying to re-establish a connection, since it means that within a few hours of a failed PPPoE connection the daemon is going to shut down and need manual intervention to re-authenticate. Note also that the May 31 time is 2 and a half hours before the lady restarted ESXi, so it's likely that it gave up pretty quickly.
Anyway, in order to see what else I could do to deal with the root cause I rebuilt the ESXi box whilst I was there (with the same 6.0U2), as my thought was that if you guys at ESF were adamant that the issue existed outside of pfSense then that was the only item remaining to change. I postulated that it could also have been that the ESXi firewall was somehow coming down over the VM at random periods.
The box has subsequently been up for 10 and a half hours so far. This is not a new record however, as the VM lasted about 26 hours previously before it died and couldn't be contacted again. Hence I was going to leave it a couple of days to post that it might have been a VMware issue too, but needed to reply now you prompted me so that you didn't think I was abandoning the issue.
#5 Updated by Michael Knowles over 3 years ago
Goddammit, no, the rebuild of VMware wasn't the problem, as it's just gone down AGAIN.
Strangely enough, the RADIUS logs from the ISP indicate that pfSense managed to cope with a lost carrier signal at 12:40am today, but the port error at 8:41am was enough to make it stop responding.