Software iSCSI failover lockup/freeze

This post was originally published on this site

Hi all,

We have a really simple setup-3 hosts, two switches and storage:

3x HPE ProLiant DL360 Gen10

2x Cisco Nexus n9k

1x HPE/Nimble HF40

vSphere 6.7 u3

 

Each host has a dedicated iSCSI standard switch with two ports assigned. It has two portgroups, with a vmk assigned to each port group. One uplink port is assigned to the vmk, the other is set inactive and vice versa for the other portgroup. These are cross connected to the Nexus switches. iSCSI discovery is to the Nimble group IP.

 

The switches have a stack link (apologies, not much of a network guy so actual detail might be lacking) and dedicated iSCSI VLAN which the hosts and storage are connected to. Everything is configured jumbo frames host-switch-storage and is running 10Gb over DAC cabling.

 

Storage is a basic setup, the iSCSI side is dedicated to data flow. vCenter is on a standard VMFS datastore and everything else is vVOL. VASA integration is enabled from storage. Nimble Connection Service and Path Selection Plug-in are installed on the hosts (latest version of these, 7.0) and the datastore/vVOLs are using Nimble_PSP_Directed for path selection policy.

 

We need this to be highly available and for the most part it is. The problem comes when testing iSCSI path failure, whether that is by switch failure, host NIC fault or storage NIC fault. If connected to vCenter web, it will become unresponsive for 30-45 seconds (selecting another item will not load anything, the blue circle spins in the top right of the screen). Similar happens with Windows guests, if connected through remote console or RDP, we lose access for 30-45 seconds. Ping continues to work. Predictably, vSphere logs path degradation error.

 

Now, according to Path Failover and Virtual Machines this is expected behaviour but then in Array-Based Failover with iSCSI it says reconnection happens quickly, I’m under the impression we’re using array based failover?

 

Ideal scenario is we don’t lose any access to machines for longer than a few seconds, not tens of seconds. VMWare, Nimble and Cisco support are all scratching their heads over this.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.