It came to me that many people ain’t very aware that vSAN has intelligence that is not just because it is a normal hyper convergent infrastructure infra (HCI) software but it is tightly integrated with vSphere to provide that intelligent.
Often many of us get confused in what way vSAN is so intelligent? vSAN does come with Object Component State and there are four states namely:
- Active – Stale
So let go through each of them to have a complete understanding. A good article that summarize can be found here.
This is the simplest. vSAN indicates this that the object is accessible with no errors and it is accessible to function as it is.
Active – Stale
This happens when an object not in sync which means has data that is not updated. For illustration purposes, assuming a FTT=1 with RAID 1 is in place, the object should have a replica. Both copies of replica should be in sync and a sequence number is written to ensure both are up to date. When one of the replicas is not updated to date due to perhaps a network disconnection between two hosts, and after resuming, the data is a track that it is not up to date based on the sequence number, it will start to repair from where it is lost. Here is an article that explains. The intelligent thing is that in vSAN, it will decide whether a rebuild of a replica or repair the data, whichever is faster.
Some solutions would mostly do a full repair and use up all the bandwidth for such activity leaving less for critical one when you need it. vSAN, in this case, is able to determine whether a full rebuild makes sense or a partial repair.
This is easy, it just means the component object is missing and not present. This can happen when one of the object components is on failed hosts and you are left with only one data set. So vSAN will start rebuilding after a delay time (60 minutes) which you can adjust that following this kb. A good article to read more on this.
Why 60 minutes?
This was tested to be the meantime ideally is 60 minutes to determine that its a real failure.
Why not repair immediately?
The reason is it could not determine the cause of the lost. If your host is only disconnected due to a trip of a network cable and fix within 10 minutes, would you want to start a full rebuild for a 1TB size of a VM or wait to determine real failure before doing?
Degrade is simply showing the tight integration between vSphere and vSAN. With degrade, vSAN is aware of inaccessible error codes produced. In a scenario where a disk controller, cache disk or a capacity disk failure, vSAN is aware to understand that error code that the device is not coming back. vSAN will trigger an immediate repair of object components. This article illustrates it.
vSAN is intelligent to know that since it understands the error code. So it will determine if it is a real failure or wait to confirm. While some solutions would just perform full rebuild in all types of scenarios since it does not differentiate the difference, and flood the network pipe which might end up having less for critical data usage.