ESXi host not reachable on Network – How to Troubleshoot


This post is going to explain you the troubleshooting experience of one of my recent issues which was caused by Pause Flood issue on HP Virtual Connect. Possibly All VMware Administrators will aware about the basic network troubleshooting like try to reach the host via ping, check for Physical NIC failures, Cable connectivity, and switch port failures or even router failure. This post is not going to explain you with this procedures for basic troubleshooting


I got a alert from the monitoring team for one of the ESXi host is not reachable on the network.  I thought may be PSOD (Purple Screen of Death) on host. I assume to reboot the host and fix the PSOD. When i connect to the ILO of my ESXi host, Host was Up and i tried to reach via ping but it is not reachable. I suspect issue could be problem with the Network adapter but it is not. Again thought to check the physical cabling of the host .That is also good. I checked with network team for switch port failures and it is also good.  I have verified the status of the network adapters of ESXi host from ILO . It was showing all NICs are down.



My ESXi host is running on Blade server and we are using HP virtual Connect as the interconnect for our servers in the blade chassis. I suspect ther could be something wrong with my virtual connect. So decided to analyze my HP Virtual connect Logs. I found the error message “Port was  disabled because a pause flood was detected”  from my Virtual connect System Logs.


When i checked the ports status of My Virtual Connect interconnect Bays, It displays the below information:
Connect to the HP virtual Connect -> Hardware -> Click on InterConnect Bays -> Click Bay 1 or Bay 2 ->Verify the status under Server Ports tab. It displays the Port status of “Not Linked/Pause Flood Detected”. It confirms the issue was caused by pause Flood. In some cases, a flex-10 port can enter into disabled state due to the triggering of “pause-flood”, or network-loop.

You can confirm the same port status using Virtual Connect Manager CLI.
Connect to Virtual connect using SSH and Execute the below command
Show port-protect

What is Pause Flood

We understood this issue was cause by Pause Flood. Let us understand what is Pause Flood. Ethernet switch interfaces use pause frame based flow control mechanisms to control data flow. When a pause frame is received on a flow control enabled interface, the transmit operation is stopped for the pause duration specified in the pause frame. All other frames destined for this interface are queued up. If another pause frame is received before the previous pause timer expires, the pause timer is refreshed to the new pause duration value. If a steady stream of pause frames is received for extended periods of time, the transmit queue for that interface continues to grow until all queuing resources are exhausted. This condition severely impacts the switch operation on other interfaces.
In addition, all protocol operations on the switch are impacted because of the inability to transmit protocol frames. Both port pause and priority-based pause frames can cause the same resource exhaustion condition. VC provides the ability to monitor server downlink ports for pause flood conditions and take protective action by disabling the port. The default polling interval is 10 seconds and is not user configurable. VC provides system logs and SNMP traps for events related to pause flood detection. This feature operates at the physical port level. When a pause flood condition is detected on a Flex-10 physical port, all Flex-10 logical ports associated with physical ports are disabled. When the pause flood protection feature is enabled, this feature detects pause flood conditions on server downlink ports and disables the port.

How to Fix Pause Flood Issue:

The port remains disabled until an administrative action is taken. The administrative action involves the following steps:
Action Plan 1: – Temporary and immediate Fix is to Re-enable the disabled ports on the VC interconnect modules using below method
1. Connect to your Virtual Connect using SSH
2. Execute the below command
 reset-port protect

3. Verify the port status again using the below command and ensure no port’s protect types are reported as “Pause Flood”
Show port-protect


That’s it the above command fixed my issue immediately.
Action Plan 2: Update the Drivers and Firmwares
Resolve the issue with the NIC on the server causing the continuous pause generation. This might include updating the NIC firmware and device drivers.
I tried the action plan 1 and immediately my ESXi host started reaching from network ping. That’s it. It resolved my issue.

Comments

Popular posts from this blog

esxi-host-shows-disconnected-vmware

ESXi : Lost uplink redundancy on virtual switch "vSwitch0". Physical NIC vmnic0 is down

Virtual machines appear as invalid or orphaned in vCenter Server