i been helping a customer troubleshooting a problem where some virtual machines was hanging.
The VM’s just stopped answering on the network, and could not be accessed in the console. The VM’s hang for around <1 minut, but some time i after a few seconds, it hang again, and could repeat this over a long period of time, mostly the long hangs was during the night.
This was a strange problem, because of the nature of the problem:
- The VM’s was not all on the same host
- It was not all the VM on a hosts, only some of them
- We tried with different version of VMware Tools, bu they still hang
- It was different OS’s
- Long hangs was mostly after image backup, when it removed the snapshot, but not exclusively
- VM using both PVSCSI and LSILOGIC controller
- It happend at random time
What we found:
I the ESXi server logs, we could see an “1 Outstanding Command”, the 1 could also be any other value more then 0. See screenshot from vRealize Log Insight:
We also made a dashboard to show this information over time:
In the Windows OS we could see Event ID 129 (failure to write to disk) and after around <1 minut, Event ID 1 (showing that the clock was set). As showing i below screenshot:
After some troubleshooting, we found and Fiberchannel ISL, that was generating a lot of errors, so ISL was disabled, cleaned, and enabled again, and there was no more errors, but the problem persisted.
The servers are Dell server, running ESXi 6.7 Update 2 EP10.
Tehe customer is using Dell Open Manage Essentials, and therefor have installed the OMSA module on the server, this was version 9.3.0.
VMware has an KB and DellEMC has an KB on that OMSA 9.3.0 can course problem, with vMotion and disconnect from vCenter. Not that it also could have influence on the virtuel machines, but anyway the customer updated this to the newest version 9.3.1.
This actually resolved the problem, they have now been running for some days without problems.
So the VMware/DellEMC KB is missing the information that this also hangs VM’s.