Project

General

Profile

Actions

Bug #3549

closed

Reported issues with VMware guests on ESX 5.1 patch 201402001

Added by Phil Jaenke over 10 years ago. Updated over 10 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
Operating System
Target version:
-
Start date:
03/27/2014
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
Affected Architecture:
i386

Description

I can't find the commit that did it, but I've confirmed it on two hosts with four installs of 2.1.1-PRE using snapshots from various dates in March. Any reconfiguration of em(4) results in the infamous watchdog timeout error. It also intermittently fails DHCP acquisition (RX side) leading into watchdog timeout. It reproduces under various conditions, including polling switching.

The bug was introduced in a very narrow window - sometime between the March 22 and March 26 snapshots. First definite occurrence is March 25 snapshot, continuing in March 26 snapshot. It's definitely only 2.1.1 too - cannot reproduce on 2.1 or 2.2-ALPHA, nor can I reproduce it with a March 18 snapshot.

Actions #1

Updated by Chris Buechler over 10 years ago

  • Status changed from New to Feedback
  • Target version deleted (2.1.1)
  • Affected Version deleted (2.1)

not seeing anything along those lines, nor is anyone else it appears. The supposed introduction dates have no even remotely related changes. I've upgraded at least 8 production VMware firewalls past the date in question, plus a handful of test systems, none of which are having any problems. Doesn't seem to be a legit report, will leave for feedback for now.

Actions #2

Updated by Phil Jaenke over 10 years ago

I have two physical boxes reproducing this, so yes, it is legit. I agree there doesn't seem to be any change that would cause it. Nonetheless, it is there on fresh installs both from restored and new configs. Most likely you're on a different VMware kernel version - needs to be 1612086. May in fact be related to the 201402001 patch which corrects a bug in the e1000 virtual device. First sign I had was when DHCP reply receive was usually 30s+ with occasional total wedge in the driver. Same exact VMX and NVRAM bounced forward or back was doing DHCP reply in <5s typical.

But since it's not not reproducing reliably on a March 27 snapshot or from a 26->27 upgrade, looks like the watchdog issue self-corrected anyhow. I'm just as stumped as to possible cause. Only guess I have, based on the symptoms and behavior, is it somehow hit an edge case landing on the wrong side of rxring.

Actions #3

Updated by Chris Buechler over 10 years ago

Is it strictly ESX 5.1 with update 201402001?

Actions #4

Updated by Phil Jaenke over 10 years ago

Yep, on both hosts - related (and very relevant) VMware KB is 2072654 and 2072652: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2072654

As I said, cannot reproduce with latest daily though. So I can only guess something was straddling the line for a while and got nudged. Interestingly, I checked my testing notes, and it also failed to hit link down when it should have in em_local_timer() - i.e.; em(4) might have actually lost timing rather than the MAC. Looks like it may have also lost it's mind on link state - getting an UP state when actually DOWN.

Wondering if this might have actually been part of lingering time issues (calcru backwards, not present in 2.1) exhibiting as an em(4) failure.

Actions #5

Updated by Chris Buechler over 10 years ago

  • Subject changed from em(4) recently broken for VMware guests on 2.1.1-PRE to Reported issues with VMware guests on ESX 5.1 patch 201402001

A timing issue causing a variety of other issues is definitely a more likely cause if you were getting calcru runtime went backwards logs and/or other signs of general timekeeping issues.

Will leave this to Feedback for now, might get more feedback as others install 201402001.

Actions #6

Updated by Phil Jaenke over 10 years ago

Concur on leaving it as monitoring for now, since it's self-corrected. The calcru issues have been around since first 2.1.1 snaps and aren't present on 2.1-REL, for reference. They weren't present when em(4) broke, came back when it unbroke. I should've noticed that first. Cure worse than the disease?

Waiting on upstream to do a lease clear for more testing. I still can't reproduce on vanilla 8.3 boxes or anything else. :/

Actions #7

Updated by Chris Buechler over 10 years ago

  • Status changed from Feedback to Closed

don't think there are any actual issues here, would have heard from a lot more people by now I expect.

Actions

Also available in: Atom PDF