Chef on Windows – detecting and fixing WMI problems which prevent chef-client runs


At thetrainline.com we use Opscode Chef for managing our build infrastructure. Like many other tools running on Windows, the chef-client ohai framework relies on WMI for extracting information about the server machine on which scripts are being run. We found that Windows WMI repository corruption can cause chef-client runs to fail due to missing WMI classes, which causes the node to remain out of policy. The WMI repo can be repaired using winmgmt /salvagerepository, and the WMI errors can be monitored using the WMIDiag script to alert on WMI repository corruption before future chef-client runs. This post details how we detected and fixed the problem, and how to monitor for WMI repository corruption.

Background

We’ve been using Chef since early 2012 to manage (define, test, control) our build infrastructure, which consists of around 200 machines, of which about 150 are near-identical build agents for ThoughtWorks GO, and the other 50 specialised servers for things like Artifactory, NuGet, Git + Gitolite, Subversion, Graphite, etc. Of these 50 ‘core’ machines, perhaps half run on Windows Server 2008 R2 (the others run a RedHat flavour of Linux). Most of these core machines (both Windows and Linux) are now brought under policy using Chef (we’ve also automated the provisioning of the GO agents, but that’s for a future blog post!).

Ohai, WMI!

When chef-client runs on a node, it executes the ohai reporting tool to gather information about that node and report it back to the server (partly to allow search-driven recipes). If the node is running Windows, then ohai relies on WMI to gather the node data, which is the right pattern for the Windows platform. When the ohai Ruby code retrieves information from the node, it actually calls out to the underlying WMI subsystem:

[2013-07-08T15:20:08+01:00] DEBUG: Loading plugin windows::cpu
[2013-07-08T15:20:10+01:00] DEBUG: Loading plugin windows::filesystem
[2013-07-08T15:20:13+01:00] DEBUG: Loading plugin windows::uptime

The Ruby code looks like this:

The ruby-wmi gem bridges directly into the Windows WMI subsystem in order to retrieve (here) uptime data. The same pattern is followed for most of the other data gathered by ohai for Windows nodes.

Chef::Exceptions::CannotDetermineNodeName: Unable to determine node name

Recently a chef-client run on one of the Windows nodes failed with ‘Chef::Exceptions::CannotDetermineNodeName: Unable to determine node name‘, which seemed weird to say the least. Following the backtrace, we would see that something relating to WMI was not right:

The key to solving the problem here is the error code 80041002, which in WMI terms means Object cannot be found. Specifically, this is the COM/OLE subsystem for WMI (aka WBEM) complaining that it cannot find the requested WMI class or value, in this case likely SystemUpTime, using the ::WMI::Win32_PerfFormattedData_PerfOS_System helper methods of ruby-wmi. It seems that malformed WMI operations (from custom WMI providers?) may be to blame, leading to a situation where

the WMI service incorrectly processes a deletion operation inside the WMI repository. [KB 2464876]

Whatever the cause, the result is that WMI operations (including read) fail; this means that chef-client cannot extract node data via ohai, and therefore the node cannot be brought under policy via Chef, and so the node is effectively broken.

How to fix a missing or corrupted WMI repository

In order to fix a corrupted WMI repository, first run a disk check (chkdsk) to make sure that the corruption is not at the storage level. D’uh. Next, check that the WMI Control snap-in does indeed have difficulty in connecting to WMI; it should complain about ‘Failed to initialize all required WMI classes’:

WMI Control General -Failed to initialize all required WMI classes

WMI Control General – Failed to initialize all required WMI classes

Then pick one of the three kinds of fixes for broken WMI repositories: the simple, the sordid, or the scary:

  • Simple: run winmgmt /salvagerepository against the local WMI repository
  • Sordid: restore the WMI repository from a WMI backup (if you have one)
  • Scary: re-install Windows from the rescue disk

Fortunately, in our case, the simple scheme worked. There are full instructions for restoring a WMI repository on Technet here, but the essence is this:

First, we prevent the WMI service (winmgmt) from auto-starting, and then stop the service. Next, we run the winmgmt /salvagerepository command against the default WMI repo at %windir%\System32\wbem, which (here) successfully salvages the repository with the message ‘WMI repository has been salvaged‘. Finally, we change the winmgmt service back to auto-start.

How to monitor for a corrupted WMI repository

A corrupted WMI repository on a server node is a blocker for managing that node with Chef, because chef-client cannot run its ohai plugins (disabling ohai isn’t really the answer here). We therefore want to monitor servers for corrupted WMI repositories and alert us ahead of a Chef run.

There is a useful but unsupported tool from Microsoft called WMIDiag 2.1, [download] which can be run non-interactively:

cscript WMIdiag.vbs SILENT

This generates output something like this:

.5491 12:34:22 (1) !! ERROR: WMI CONNECTION errors occured for the following namespaces: ...... 53 ERROR(S)! 
.5492 12:34:22 (0) ** - ROOT/DEFAULT, 0x80041002 - (WBEM_E_NOT_FOUND) Object cannot be found. 
.5493 12:34:22 (0) ** - ROOT/CIMV2, 0x80041002 - (WBEM_E_NOT_FOUND) Object cannot be found. 
.5494 12:34:22 (0) ** - ROOT/SECURITY, 0x80041002 - (WBEM_E_NOT_FOUND) Object cannot be found.
...
.5611 12:34:22 (0) ** ERROR: WMIDiag detected issues that could prevent WMI to work properly!.  Check 'C:\SOMEPATH\WMIDIAG-V2.1_2K8R2.______2013.07.04_12.33.18.LOG' for details.

We can see the magic COM/OLE error 80041002 again, as reported by chef-client via the Ruby WMI library. The script will terminate when WMI errors are found with a non-zero exit code, which allows us to wire up this script to our Zabbix monitoring and report nightly on any machines which (for whatever reason) have a corrupted WMI repository, giving us early feedback on Windows machines which are likely to fail a future chef-client run.

Summary

Windows WMI repository corruption can cause chef-client runs to fail due to missing WMI classes. The WMI repo can be repaired using winmgmt /salvagerepository, and the WMI errors can be monitored using the WMIDiag script to alert on WMI repository corruption before future chef-client runs.

2 thoughts on “Chef on Windows – detecting and fixing WMI problems which prevent chef-client runs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s