This post might sound a bit like a game of Cluedo, but it happened

This post might sound a bit like a game of Cluedo, but it happened on a production SQL
Server…
But now you want to prove it
This post might sound a bit like a game of Cluedo, but it happened on a production SQL
Server…
The issue showed itself on a hosted production VM with 64GB of Ram, where SQL could
allocate memory from 32GB to 55GB. The agreements with the hosting partner clearly
mentioned that while the CPU could be overcommitted, memory couldn’t.
All my alerts suddenly started to show that the machine was using all the available and
became completely unresponsive (I couldn’t even open the Event Viewer to unload the logs).
Remote connection with management studio gave me this worrisome error:
But ping responses proved to be ok (less than 3 ms) so I didn’t suspect memory, page life
expectancy gave a hint that something was certainly not OK
The Microsoft Performance dashboard showed worrying sings, so I RDP’d onto the server,
and saw the horror of any SQL DBA this SQL Server was enduring Swapping Hell… whilst it
should never even have to use its swap file… Ok something or someone murdered my SQL
Server… let’s start the CSI investigation…
In the Recource Monitor we see that SQL is “only” getting ±20GB and is struggling to run.
Swap file is overloaded and the server is running like a snail with arthritis.
So I suspected VMWare ballooning was the cruel pit here, but as this was especially agreed
upon that this machine wouldn’t use memory ballooning I had to prove that it was
happening Unfortunately as this is a hosted environment I could not check the ESX console
to see the ballooning parameters, so I had to look for another way to find the smoking gun.
Luckily I found good documentation on the VMWare site explaining that there is a command
file in the VMWare tools folder that just could to the trick, so I tried this on our murder(ed)
victim and Hurray! It gave me the proof I needed
So the murder weapon was known… VM Ware’s ballooning was to blame…
Cluedo summary
– Murder victim: Production Server
– Killed in VMWare ESX environment
– By the VM Ware ballooning service
Now I can go to the hosting partner and ask them to fix this, could become interesting
because in the contracts, we especially agreed on not turning on ballooning on production
SQL Servers
On a side note, normally VMWare should have released memory when the (SQL) Server
started to ask more resources, but somehow it didn’t, so I also send them the following link
to a known issue where the Balloon driver does not releases its memory when using default
parameters:
Balloon driver retains hold on memory causing virtual machine guest operating system
performance issues (1003470)
I’m interested how this is going to go on
On the other hand, the VMWareToolBoxCMD stat seems to be extremely useful, in many
situations, so I pasted the most interesting ones from the complete list here: