Fixing Hiccups in The SCVMM2008R2 GUI & Database


As you might very well know by experience sometimes the System Center Virtual Machine Manager GUI and database get out of sync with reality about what’s going on for real on the cluster. I’ve blogged about this before in SCVMM 2008 R2 Phantom VM guests after Blue Screen and in System Center Virtual Machine Manager 2008 R2 Error 12711 & The cluster group could not be found (0×1395)

The Issue

Recently I had to trouble shoot the “Missing” status of some virtual machines on a Hyper-V cluster in SCVMM2008R2. Rebooting the hosts, guests, restarting agents, … none of the usual tricks for this behavior seemed to do the trick. The SCVMM2008R2 installation was also fully up to date with service packs & patches so there the issue dot originate.

Repair was greyed out and was no use. We could have removed the host from SCVMM en add it again. That resets the database entries for that host en can help fix the issues but still is not guaranteed to work and you don’t learn what the root cause or solution is. But none of our usual tricks worked.We could have deleted the VMs from the database as in  but we didn’t have duplicates. Sure, this doesn’t delete any files or VM so it should show up again afterwards but why risk it not showing up again and having to go through fixing that.

The Cause

The VMs were in a “Missing” state after an attempted live migration during a manual patching cycle where the host was restarted the before the “start maintenance mode” had completed. A couple of those VMs where also Live Migrated at the same time with the Failover Cluster GUI. A bit of confusion al around so to speak nut luckily all VMs are fully operational an servicing applications & users so no crisis there.

The Fix

DISCLAIMER

I’m not telling you to use this method to fix this issue but you can at your own risk. As always please make sure you have good and verified backups of anything that’s of value to you Smile

We hade to investigate. The good news was that all VMs are up an running, there is no downtime at the moment and the cluster seems perfectly happy Smile.

But there we see the first clue. The Virtual machines on the cluster are not running on the node SCVMM thinks they are running, hence the “Missing” status.

First of all let’s find out what host the VM is really running on in the cluster and see what SCVMM thinks on what host the VM  is running. We run this little query against the VMM database. That gives us all hosts known to SCVMM.

SELECT [HostID],[ComputerName] FROM [VMM].[dbo].[tbl_ADHC_Host]

HostID                                                                        ComputerName

559D0C84-59C3-4A0A-8446-3A6C43ABF618          node1.test.lab

540C2477-00C3-4388-9F1B-31DBADAD1D8C        node2.test.lab

40B109A2-9E6B-47BC-8FB5-748688BFC0DF         node3.test.lab

C2DA03CE-011D-45E3-A389-200A3E3ED62E        node4.test.lab

6FA4ABBA-6599-4C7A-B632-80449DB3C54C         node5.test.lab

C0CF479F-F742-4851-B340-ED33C25E2013          node6.test.lab

D2639875-603F-4F49-B498-F7183444120A             node7.test.lab

CE119AAC-CF7E-4207-BE0B-03AAE0371165         node8.test.lab

AB07E1C2-B123-4AF5-922B-82F77C5885A2           node9.test.lab

(9 row(s) affected)

Voila en now the fun starts. SCVMM GUI tells us “MissingVM” is missing on node4.

We check this in the database to confirm:

SELECT Name, ObjectState, HostId
FROM VMM.dbo.tbl_WLC_VObject
WHERE Name = 'MissingVM'
GO

Which is indeed node4

Name                                                                                                                                                                                                                                                             ObjectState HostId

———  —  ————————————

node4  220  C2DA03CE-011D-45E3-A389-200A3E3ED62E

(1 row(s) affected)


In SCVMM we see that the moving of the VM failed. Between node 4 and node 6.

image

Now let’s take a look at what the cluster thinks … yes there it is running happily on node 6 and not on node 4. There’s the mismatch causing the issue.

So we need to fix this. We can Live Migrate the VM with the Failover Cluster GUI to the node SCVMM thinks the VM still resides on and see if that fixes it. If it does, great! You have to give SCVMM some time to detect all things and update its records.

But what to do if it doesn’t work out?  We can get the HostId from the node where the VM is really running in the cluster, which we can see in the Failover Cluster GUI, from the query we ran above and than update the record:

UPDATE VMM.dbo.tbl_WLC_VObject
SET HostId  = 'C0CF479F-F742-4851-B340-ED33C25E2013'
WHERE Name = 'MissingVM'
GO

We then reset the ObjectState to 0 to get rid of the Missing status. It would do this automatically but it takes a while.

UPDATE VMM.dbo.tbl_WLC_VObject
SET ObjectState = '0'
WHERE Name = 'MissingVM'
GO

After some patience & Refreshing all is well again and test with live migrations proves that all works again.

As I said before people get creative in how to achieve things due to inconsistencies, differences in functionality between Hyper-V Manager, Failover Cluster Manager and SCVMM 2008R2 can lead to some confusing situations. I’m happy to see that in Windows 8 the action you should perform using the Failover Cluster GUI or PowerShell are blocked in Hyper-V Manager. But SCVMM really needs a “reset” button that makes it check & validate that what it thinks is reality.

Upgrading Hyper-V Cluster Nodes to Windows Server 2012 (Beta) – Part 3


This is a multipart series based on some lab test & work I did.

  1. Part 1 Upgrading Hyper-V Cluster Nodes to Windows Server 2012 (Beta) – Part 1
  2. Part 2 Upgrading Hyper-V Cluster Nodes to Windows Server 2012 (Beta) – Part 2
  3. Part 3 Upgrading Hyper-V Cluster Nodes to Windows Server 2012 (Beta) – Part 3

And we have arrived at part three of my adventures while “transitioning” my Hyper-V cluster nodes to Windows Server 2012. I prefer the term transition as is more correct. We can still not do a rolling upgrade a cluster cluster. We still need to create a new cluster and recuperate the evicted nodes.

I’ll repeat myself here (again) by stating I did not reinstall the evicted nodes but upgraded them. Why, because I can and I wanted to try it out and see what happens. For production purposes I do advise you to rebuild nodes from scratch using a well defined and automated plan if possible. I already mentioned this in Upgrading Hyper-V Cluster Nodes to Windows Server 2012 (Beta) – Part 1

Moving the Storage & Hyper-V Guests

So we stopped Part 2 at a newly created cluster without any storage. That’s what we’ll be taking care of in this part.  Let’s recap what we already mentioned at the end of Part 2.

We have several options for storage here. We could assign new storage but we cannot do a Quick Storage Migration between cluster using SCVMM2008R2 but that doesn’t fly as SCVMM2008R2 can’t manage Windows 2012 clusters and I don’t know if it ever will. We can do a good old manual or scripted export to and import from the new storage of the VMs what takes a considerable amount of time. You also need to have the extra storage available.

We can also recuperate the old storage with the VMs still on there. This could get tricky as no two cluster should be able to see & use the storage at the same time. The benefit could be that we can just use the import type in Windows Server 2012 “Register the virtual machine in-place” (use the existing unique ID) and be done with it. We’ll try that one. We’ll still have some down time but it should be pretty fast. It’s only from Windows Server 2012 on that we’ll be able to do Shared Nothing Live Migrations between clusters Smile and live will be good. If you have a SAN you could also use clones to get this job done without less risk. You work on cloned data and keep the original around instead of using that for the process described below.

So how do we approach this?

Since Windows Server 2008 storage & clustering isn’t the pain it could be in earlier version. It’s the disk manager handling all that and it makes live a lot easier. All disks presented to a cluster node are off line to the operating system until you bring it online. Even if it contains data or is presented to another host, whether that is a member of another cluster or a stand alone host. Pretty cool. It also means you can have all your nodes on line during the process. The process of bringing the disk online and, if needed formatting it with NTFS and then adding it to the cluster as storage can be done on just one of the nodes.

As you recall I unplugged the evicted node from the iSCSI storage (you could also disable the ports) before I upgraded it. The entire iSCSI configuration got upgraded perfectly so all I needed to do was plug the iSCSI cables in and the storage appeared offline. My old cluster node was up and running still accessing it. Pretty slick! And great as a demo but you can play it safer. That was fun Smile but perhaps we won’t be that brave in a production environment.

Options

You could decide to bring all LUNS over at once or one at the time. The process is the same. If you do it one by one you’ll have to rely on the above behavior to protect the LUNs against corruption or you can un-present the LUNS remaining on the old cluster from the new cluster so you’ll never have an issue. We’ve done both and it works out rather fine in testing. Windows clustering is really doing it’s best to prevent you from shooting yourself in the foot Smile

Let’s say I go LUN by LUN. Now I can just remove the VMs from the old cluster using the Failover cluster GUI so they are no longer highly available on that node. When I have no more clustered VMs on a CSV LUN I can shut down all the guests in Hyper-V Manager and stop right there.

On the old cluster I remove that LUN from the CSV storage and from the cluster storage. At that moment that LUN is already taken offline for you!

image

Pardon the silly size but I didn’t have space left to make a realistic screenshot Smile

Great, Windows is protecting us against any possible data corruption! So now I can than un-present the LUN form the old cluster nodes. The next step is to enable the ISCI ports, present that LUN to the new cluster node or nodes (depends on where in the x number of node process you are) or just plug in the cable .

You’ll see the new LUN off line than on the new cluster. We can than make the LUN on line so it will be available to add to the cluster. Just right click that disk and select “Online”.

image

image

 

Right click on storage

image

 

Select an disk that’s available to add to the cluster.

02

 

Things has gotten a lot simpler with CSV in Windows Server 2012. No more enabling it with a funky warning message that’s well meant but is rather confusing an annoying. You just right click the disk and choose “Add to Cluster Shared Volumes” and that’s it.

image

 

And there it is. That disk in our new cluster is ready to use as a CSV.

image

 

So we can now us a nifty new capability in Windows Server 2012 Hyper-V: “Register the virtual machine in-place” (use the existing unique ID)

05

 

The wizard starts.

06

 

Select the folder where your VM or VMs live. yes you can do multiple given that your folder structure allows for this.

07

 

It’s found one VM in our folder

08

 

We click Next

10

 

We select “Register the virtual machine in-place” (use the existing unique ID) and click next.

11

If something is not right like some forgotten “saved” states you’ll get a change to dump those or cancel the process to deal with it properly before trying it again.

12

 

If virtual network names do not match you’ll get the opportunity to set correct that by specifying what virtual switch to use.

13

 

If all was well in the first place or after you’ve fixed any issues like the ones demonstrated above you’re good to go. Click finish and enjoy your Windows Server 2012 Hyper-V Guest.

18

 

At this point you can already start your VMs. I know that the next step is to make all these VMs highly available but here we have some good news as well. You can now make running VMs highly available. Yeah! They no longer need to be shut down. All this is done via the well know process so I’m not going to walk trough the entire process here. But the screen shot of a making a running VM highly available is worth posting Smile

addrunningvm

Upgrading Hyper-V Cluster Nodes to Windows 8 (Beta) – Part 2


This is a multipart series based on some lab test & work I did.

  1. Part 1 Upgrading Hyper-V Cluster Nodes to Windows Server 8 (Beta) – Part 1
  2. Part 2 Upgrading Hyper-V Cluster Nodes to Windows 8 (Beta) – Part 2
  3. Part 3 Upgrading Hyper-V Cluster Nodes to Windows 8 (Beta) – Part 3

Here’s part two of my adventures while upgrading or rather “transitioning” my Hyper-V cluster nodes to Windows 8. Transition is more correctly as you can not upgrade a cluster, you create a new cluster en recuperate the node. I did however not reinstall them but upgrade them. Why, because I can and I wanted to try it out to see what happens. For production purposes I do advise you to rebuild nodes from scratch using a well defined and automated plan if possible. I already mentioned this in Upgrading Hyper-V Cluster Nodes to Windows Server 8 (Beta) – Part 1

So we stopped Part 1 with a evicted and upgraded node. We’ll want to create a new cluster with that node and then transition the other nodes over to the new Windows 8 cluster one by one, or in batches, depending on how many you can afford to take down at one time. In this part we’ll just build our new Window 8 cluster with a single node. It’s a good thing this is possible as we can start a transition with just one node. This an easy part.

First of all we create a new cluster. I will all look very familiar if you’ve ever created a Windows 2008 (R2) cluster.

image

 

The Create Cluster Wizard appears, read all the advice you want and click “Next”

02

 

We select the node that we evicted from the old cluster and upgraded to Windows 8

03

04

 

You now run the validation test for your cluster

06

Let’s run ‘m all and see what it has to say.

07

 

We get a summary of what notes will be tested and what tests will be run. Click “Next”

08

 

The tests are running.

09

 

We get a pass with some warnings. So we click “View Report” to take a look. It’s OK we only have one node, we don’t have storage yet and networking wise we still need to configure some things but we can create a one node cluster, So click “Finish”

10

image

 

I named my new cluster “warriors”, the old one was called “warrior”.

12

 

I define the IP Address for the Access Point for administering the cluster

13

 

We’re ready to create the cluster so we click “Next” and the creation process starts

15

16

 

And we’re informed we’ve have successfully created a cluster. Click Finish. Any experienced cluster builder should find this process very familiar without surprises.

17

 

So now we have a cluster existing out of one node and we haven’t got any storage assigned yet.

We have several options for storage here. We could assign new storage but we cannot do a Quick Storage Migration between cluster using SCVMM2008R2 but that doesn’t fly as SCVMM2008R2 can’t manage Windows 8 clusters and I don’t know if it ever will.  We can do a good old manual or scripted export and import of the VMs what takes a considerable amount of time.

We can recuperate the old storage with the VMs still on there. This could get tricky as no two cluster should be able to see & use the storage at the same time. The benefit could be that we can just use the import type in Windows 8 ("Register the virtual machine in-place" (use the existing unique ID) and be done with it. We’ll try that one. We’ll still have some down time but it should be pretty fast. It’s only from Windows 8 on that we’ll be able to do Shared Nothing Live Migrations between clusters Smile We’ll address that in Part 3.

Integration Services Version Check Via Hyper-V Integration/Admin Event Log


I’ve written before (see "Key Value Pair Exchange WMI Component Property GuestIntrinsicExchangeItems & Assumptions") on the need to & ways with PowerShell to determine the version of the integration services or integration components running in your guests. These need to be in sync with the one running on the hosts. Meaning that all the hosts in a cluster should be running the same version as well as the guests.

During an upgrade with a service pack this get the necessary attention and scripts (PowerShell) are written to check versions and create reports and normally you end up with a pretty consistent cluster. Over time virtual machines are imported, inherited from another cluster of created on a test/developer host and shipped to production. I know, I know, this isn’t something that should happen, but I don’t always have the luxury of working in a perfect world.

Enough said. This means you might end up with guests that are not running the most recent version of the integration tools. Apart from checking manually in the guest (which is tedious, see my blog "Upgrading a Hyper-V R2 Cluster to Windows 2008 R2 SP1" on how to do this) or running previously mentioned script you can also check the Hyper-V event log.

Another way to spot virtual machines that might not have the most recent version of the integration tools is via the Hyper-V logs. In Server Manager you drill down in the “Diagnostics” to, “Event Viewer” and than navigate your way through  "Applications and Services Logs", "Microsoft", "Windows" until you hit “Hyper-V-Integration

image

Take a closer look and you’ll see the warning about 2 guests having an older version of the integration tools installed.

image

As you can see it records a warning for every virtual machine whose integration services are older than the host running Hyper-V. This makes it easy to grab a list of guest needing some attention. The down side is that you need to check all hosts, not to bad for a small cluster but not very efficient on the larger ones.

So just remember this as another way to spot virtual machines that might not have the most recent version of the integration tools. It’s not a replacement for some cool PowerShell scripting or the BPA tools, but it is a handy quick way to check the version for all the guests on a host when you’re in a hurry.

It might be nice if integration services version management becomes easier in the future. Meaning a built-in way to report on the versions in the guests and an easier way to deploy these automatically if there not part of a service pack (this is the case when the guest OS and the host OS differ or when you can’t install the SP in the guest for some application compatibility reason). You can do this in bulk using SCVMM and of cause Scripting this with PowerShell comes to the rescue here again, especially when dealing with hundreds of virtual machines in multiple large clusters. Orchestration via System Center Orchestrator can also be used. Integration with WSUS would be another nice option, for those that don’t have Configuration Manager or Orchestrator but that’s not supported as far as I know for now.

Know What Receive Side Scaling (RSS) Is For Better Decisions With Windows 8


Introduction

As I mentioned in an introduction post Thinking About Windows 8 Server & Hyper-V 3.0 Network Performance there will be a lot of options and design decisions to be made in the networking area, especially with Hyper-V 3.0. When we’ll be discussing DVMQ (see DMVQ In Windows 8 Hyper-V), SR-IOV in Windows 8 (or VMQ/VMDq in Windows 2008 R2) and other network features with their benefits, drawbacks and requirements it helps to know what Receive Side Scaling (RSS) is. Chances are you know it better than the other mentioned optimizations. After all it’s been around longer than VMQ or SR-IOV and it’s beneficial to other workloads than virtualization. So even if you’re a “hardware only for my servers” die hard kind of person you can already be familiar with it. Perhaps you even "dislike” it because when the Scalable Networking Pack came out for Windows  2003 it wasn’t such a trouble free & happy experience. This was due to incompatibilities with a lot of the NIC drivers and it wasn’t fixed very fast. This means the internet is loaded with posts on how to disable RSS & the offload settings on which it depends. This was done to get stability or performance back for application servers like Exchange and others applications or services.

The Case for RSS

But since Windows 2008 these days are over. RSS is a great technology that gets you a lot better usage of out of your network bandwidth and your server. Not using RSS means that you’ll buy extra servers to handle the same workload. That wastes both energy and money. So how does RSS achieve this? Well without RSS all the interrupt from a NIC go to the same CPU/Core in multicore processors (Core 0).  In Task Manager that looks not unlike the picture below:

image

Now for a while the increase in CPU power kept the negative effects at bay for a lot of us in the 1Gbps era. But now, with 10Gbps becoming more common every day, that’s no longer the case. That core will become the bottle neck as that poor logical CPU will be running at 100%, processing as much network interrupts in can handle, while the other logical CPU only have to deal with the other workloads. You might never see more than 3.5Gbps of bandwidth being used if you don’t use RSS. The CPU core just can’t keep up. When you use RSS the processing of those interrupts is distributed across al cores.

With Windows 2008 and Windows 2008 R2 and Windows 8 RSS is enabled by default in the operating system. Your NIC needs to support it and in that case you’ll be able to disable or enable it. Often you’ll get some advanced features (illustrated below) with the better NICs on the market. You’ll be able to set the base processor, the number of processors to use, the number of queues etc. That way you can divide the cores up amongst multiple NICs and/or tie NICs to specific cores.

image

image

So you can get fancy if needed and tweak the settings if needed for multi NIC systems. You can experiment with the best setting for your needs, follow the vendors defaults (Intel for example has different workload profiles for their NICs) or read up on what particular applications require for best performance.

Information On How To Make It Work

For more information on tweaking RSS you take a look at the following document http://msdn.microsoft.com/en-us/windows/hardware/gg463392. It holds a lot more information than just RSS in various scenarios so it’s a useful document for more than just this.

Another good guide is the "Networking Deployment Guide: Deploying High-Speed Networking Features". Those docs are about Windows 2008 R2 but they do have good information on RSS.

If you notice that RSS is correctly configured but it doesn’t seem to work for you it’ might be time to check up on the other adaptor offloads like TCP Checksum Offload, Large Send Offload etc. These also get turned of a lot when trouble shooting performance or reliability issues but RSS depends on them to work. If turned off, this could be the reason RSS is not working for you..

Benign GUI Cosmetic Bug in Failover Cluster Manager (UseMnemonic Property)


Here’s a little issue I‘ve run into with using an ampersand (&) in the naming of the networks in the Failover Cluster Manager.

As you can see the name “Heart Beat & CSV” shows up correctly in the left side navigation pane. In the management pane is show up as “Heart Beat _CSV”.

image

So me being a bit an old scripter / VBA / VB developer I have seen this before and I try what I know to do from that far away, long ago and dusky part of my IT history: type in double ampersands (&&). The good old UseMnemonic Property for you in the know Winking smile VB & VBA devies wanting to display an & on a button, label etc. will know this trick of using && to really display a & as a single & indicates an action. But I digress.

image

So as you can see it’s fixed in the management pane but now you end up with double ampersands in the left navigation pane.

image

And then it also shows up with double ampersands in PowerShell

image

This is one for the GUI team to fix I guess. Perhaps the UseMnemonic Property is set to false in the navigation pane label and to true in the management pane label. So far my frivolous reporting on benign GUI cosmetics bugs in Windows 2008 R2 SP1 Open-mouthed smile We’ll be resuming our more serious blog posts in the very near future.

Direct Connect iSCSI Storage To Hyper-V Guest Benefits From VMQ & Jumbo Frames


As I was preparing a presentation on Hyper-V cluster high available & high performance networking by, you guessed it, presenting it. During that presentation I mentioned Jumbo Frames & VMQ (VMDq in Intel speak)  for the virtual machine, Live Migration and CSV network. Jumbo frames are rather well know nowadays but VMQ is still something people have read about, at best have tinkered with, but no many are using it in production.

One of the reason for this that it isn’t explained and documented very well. You can find some decent explanation on what it is and does for you but that’s about it. The implementation information is woefully inadequate and, as with many advanced network features, there are many hiccups and intricacies. But that’s a subject for another blog post. I need some more input from Intel and or MSFT before I can finish that one.

Someone stated/asked that they knew that Jumbo frames are good for throughput on iSCSI networks and as such would also be beneficial to iSCSI networks provided to the virtual machines. But how about VMQ? Does that do anything at all for IP based storage. Yes it does. As a matter of fact It’s highly recommend by MSFT IT in one of their TechEd 2010 USA presentations on Hyper-V and storage.

So yes enable VMQ on both NIC ports used for iSCSI to the guest. Ideally these are two dedicated NICs connected to two separate switches to avoid a single point of failure. You do not need to team these on the host or have Multiple Path I/O (MPIO) running for this mat the parent level. The MPIO part is done in the virtual machines guests themselves as that’s where the iSCSI initiator lives with direct connect. And to address the question that followed, you can also use Multiple Connections per Session (MCS) in the guest if your storage device supports this but I must admit I have not seen this used in the wild. And then, finally coming to the point, both MPIO and MCS work transparently with Jumbo Frames and VMQ. So you’re good to go Smile