VMDK / API Snapshot woes

This post was originally published on this site

Where do I begin.

 

I feel like I am always a newbie with VMware, despite working in it for a few years. We are running a VSAN environment on 6.5, performing backups with Veeam. About 2 weeks ago, some of our vm’s in the backup started throwing errors. Due to some events outside of my control, I just started looking at this today. Veeam support said the error was because the VMX file was corrupt, recommended solution was to shutdown machine, remove from inventory, create new machine using the existing disks, bring it back up. We performed this solution on a non critical machine, and it worked great. Did it to a semi-critical machine, and worked great again. Did it to our Exchange server and.. it wasn’t great.

 

The server came back up, however after a few hours of operation, a large amount of people reported missing about 2 weeks of email. We had the machine up for about 5 hours poking around at logs before I shut it down to focus on the VMware side of things. After a ton of digging on the guest as well as in the host environment, I figured out the root cause- despite there being no snapshots in the snapshot manager, the system was running off of a snapshot due to the failed backup. I made the mistake of mounting the original vmdk files on booting rather than the 000001.vmdk file. My own mistake of making assumptions, thinking those files were somehow orphaned since the snapshot manager listed no snapshots. The previous, successful machines either didn’t have a snapshot file, or historical data didn’t matter on that guest.

 

After talking with VMware support, they basically said since the original vmdk’s were booted, the damage is done, consider the data lost. They did say I can try to remove the drives from the guest, and try to re-add the snapshot versions, but had little faith that it would work, and warned of a high chance of corruption of both the vmdk and the snapshot vmdk. Since the last shutdown, I’ve kept the server powered off and have been seeking any type of option to try and get this machine back to life with its current data, and have ran into a brick wall every time. Mostly being cautious on any steps tried from this point due to the corruption warnings, I’ve copied out all files save for the snapshot files from the original location of the datastore to a different location to mitigate risk of further corruption. The snapshot files however, will simply not budge. Web client copy, SSH copy, vmkfstools -i, nothing will get those files to somewhere else in their original size (though I can download what looks to be the header with WinSCP).

 

I’m desperately trying to safeguard the snapshot data before doing something that may corrupt the whole guest and get this thing back in an up to date, running condition. Since this is an Exchange server, the files are quite large. Just copying out the files took 3hrs. I’m now attempting a clone as I’ve read a clone may merge snapshot files automatically, with the hope that it won’t impact the original files. If the clone doesn’t work, I’d be at the last straw to try to boot off of the snapshots, knowing I may lose everything. Finally I’ve landed here, seeing some users get success by some of you truly amazing experts here. The final kick in the rear, is our management is getting ready to suffer the data loss just to get the server back on and email flowing, so their patience is thin. Casting out a bottle in the sea here, hoping it comes back with some much needed help in time. Attaching relevant info that I’ve seen requested in other posts:

 

Directory ls -lh of original files:

 

-rw-r–r–    1 root     root          92 Oct 24  2018 CAKEXK01-8d4db6ef.hlog

-rw——-    1 root     root       32.6K Nov 15 08:02 CAKEXK01-Snapshot557.vmsn

-rw-r–r–    1 root     root          13 May  8  2019 CAKEXK01-aux.xml

-rw——-    1 root     root        8.5K Nov 14 08:12 CAKEXK01.nvram

-rw——-    1 root     root          45 Nov 14 08:12 CAKEXK01.vmsd

-rwx——    1 root     root        4.6K Dec  6 21:22 CAKEXK01.vmx

-rw——-    1 root     root        3.3K May 17  2018 CAKEXK01.vmxf

-rw——-    1 root     root        5.0M Dec  6 21:22 CAKEXK01_3-000001-ctk.vmdk

-rw——-    1 root     root         408 Nov 15 08:02 CAKEXK01_3-000001.vmdk

-rw——-    1 root     root         600 Dec  7 04:12 CAKEXK01_3.vmdk

-rw——-    1 root     root        5.9M Dec  6 21:22 CAKEXK01_4-000001-ctk.vmdk

-rw——-    1 root     root         409 Nov 15 08:02 CAKEXK01_4-000001.vmdk

-rw——-    1 root     root         576 Dec  7 04:12 CAKEXK01_4.vmdk

-rw——-    1 root     root        2.0M Dec  6 21:22 CAKEXK01_5-000001-ctk.vmdk

-rw——-    1 root     root         407 Nov 15 08:09 CAKEXK01_5-000001.vmdk

-rw——-    1 root     root         598 Dec  7 04:12 CAKEXK01_5.vmdk

drwxr-xr-x    1 root     root         280 Dec  7 06:38 bak

-rw——-    1 root     root      299.5K May 17  2018 vmware-3.log

-rw——-    1 root     root       15.2M Sep 21  2018 vmware-4.log

-rw——-    1 root     root        3.0M Oct 18  2018 vmware-5.log

-rw——-    1 root     root      393.2K Oct 22  2018 vmware-6.log

-rw——-    1 root     root      467.3K Oct 24  2018 vmware-7.log

-rw——-    1 root     root      244.0K Oct 24  2018 vmware-8.log

-rw——-    1 root     root       45.4M Dec  6 21:22 vmware.log

 

Directory ls -lh of newly created machine that is pointing to the above vmdk’s:

 

-rw-r–r–    1 root     root         295 Dec  6 21:35 CAKEXK01-35be335f.hlog

-rw——-    1 root     root        8.5K Dec  7 05:25 CAKEXK01.nvram

-rw-r–r–    1 root     root           0 Dec  6 21:35 CAKEXK01.vmsd

-rwxr-xr-x    1 root     root        3.8K Dec  7 05:25 CAKEXK01.vmx

-rw——-    1 root     root        3.1K Dec  6 21:45 CAKEXK01.vmxf

-rw-r–r–    1 root     root        1.0M Dec  7 03:08 vmware-1.log

-rw-r–r–    1 root     root      322.3K Dec  7 05:25 vmware.log

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.