NIC Teaming in Windows 8 & Hyper-V


One of the many new features in Windows 8 is native NIC Teaming or Load Balancing and Fail Over (LBFO). This is, amongst many others, a most welcome and long awaited improvement. Now that Microsoft has published a great whitepaper (see the link at the end) on this it’s time to publish this post that has been simmering in my drafts for too long. Most of us dealing with NIC teaming in Windows have a lot of stories to tell about incompatible modes depending on the type of teaming, vendors and what other advanced networking features you use.  Combined with the fact that this is a moving target due to a constant trickle of driver & firmware updates to rid of us bugs or add support for features. This means that what works and what doesn’t changes over time. So you have to keep an eye on this. And then we haven’t even mentioned whether it is supported or not and the hassle & risk involved with updating a driver Smile

When it works it rocks and provides great benefits (if not it would have been dead). But it has not always been a very nice story. Not for Microsoft, not for the NIC vendors and not for us IT Pros. Everyone wants things to be better and finally it has happened!

Windows 8 NIC Teaming

Windows 8 brings in box NIC Teaming, also know as Load Balancing and Fail Over (LBFO), with full Microsoft support. This makes me happy as a user. It makes the NIC vendors happy to get out of needing to supply & support LBOF. And it makes Microsoft happy because it was a long missing feature in Windows that made things more complex and error prone than they needed to be.

So what do we get form Windows NIC Teaming

  • It works both in the parent & in the guest. This comes in handy, read on!

image

  • No need for anything else but NICs and Windows 8, that’s it. No 3rd party drivers software needed.
  • A nice and simple GUI to configure & mange it.
  • Full PowerShell support for the above as well so you can automate it for rapid & consistent deployment.
  • Different NIC vendors are supported in the same team.  You can create teams with different NIC vendors in the same host. You can also use different NIC across hosts. This is important for Hyper-V clustering & you don’t want to be forced to use the same NICs everywhere. On top of that you can live migrate transparently between servers that have different NIC vendor setups. The fact that Windows 8 abstracts this all for you is just great and give us a lot more options & flexibility.
  • Depending on the switches you have it supports a number of teaming modes:
    • Switch Independent:  This uses algorithms that do not require the switch to participate in the teaming. This means the switch doesn’t care about what NICs are involved in the teaming and that those teamed NICS can be connected to different switches. The benefit of this is that you can use multiple switches for fault tolerance without any special requirements like stacking.
    • Switch Dependent: Here the switch is involved in the teaming. As a result this requires all the NICs in the team to be connected to the same switch unless you have stackable switches. In this mode network traffic travels at the combined bandwidth of the team members which acts as a as a single pipeline.There are two variations supported.
      1. Static (IEEE 802.3ad) or Generic: The configuration on the switch and on the server identify which links make up the team. This is a static configuration with no extra intelligence in the form of protocols assisting in the detection of problems (port down, bad cable or misconfigurations).
      2. LACP (IEEE 802.1ax, also known as dynamic teaming). This leverages the Link Aggregation Control Protocol on the switch to dynamically identify links between the computer and a specific switch. This can be useful to automatically reconfigure a team when issues arise with a port, cable or a team member.
  • There are 2 load balancing options:
    1. Hyper-V Port: Virtual machines have independent MAC addresses which can be used to load balance traffic. The switch sees a specific source MAC addresses connected to only one connected network adapter, so it can and will balance the egress traffic (from the switch) to the computer over multiple links, based on the destination MAC address for the virtual machine. This is very useful when using Dynamic Virtual Machine Queues. However, this mode might not be specific enough to get a well-balanced distribution if you don’t have many virtual machines. It also limits a single virtual machine to the bandwidth that is available on a single network adapter. Windows Server 8 Beta uses the Hyper-V switch port as the identifier rather than the source MAC address. This is because a virtual machine might be using more than one MAC address on a switch port.
    2. Address Hash: A hash (there a different types, see the white paper mention at the end for details on this) is created based on components of the packet. All packets with that hash value are assigned to one of the available network adapters. The result is that all traffic from the same TCP stream stays on the same network adapter. Another stream will go to another NIC team member, and so on. So this is how you get load balancing. As of yet there is no smart or adaptive load balancing available that make sure the load balancing is optimized by monitoring distribution of traffic and reassigning streams when beneficial.

Here a nice overview table from the whitepaper:

image

Microsoft stated that this covers the most requested types of NIC teaming but that vendors are still capable & allowed to offer their own versions, like they have offered for many years, when they find that might have added value.

Side Note

I wonder how all this is relates/works with to Windows NLB, not just on a host but also in a virtual machine in combination with windows NIC teaming in the host (let alone the guest). I already noticed that Windows NLB doesn’t seem to work if you use Network Virtualization in Windows 8. That combined with the fact there is not much news on any improvements in WNLB (it sure could use some extra features and service monitoring intelligence) I can’t really advise customers to use it any more if they want to future proof their solutions. The Exchange team already went that path 2 years ago. Luckily there are some very affordable & quality solution out there. Kemp Technologies come to mind.

  • Scalability.You can have up to 32 NIC in a single team. Yes those monster setups do exist and it provides for a nice margin to deal with future needs Smile
  • There is no THEORETICAL limit on how many virtual interfaces you can create on a team. This sounds reasonable as otherwise having an 8 or 16 member NIC team makes no sense. But let’s keep it real, there are other limits across the stack in Windows, but you should be able to get up to at least 64 interfaces generally. Use your common sense. If you couldn’t put 100 virtual machines in your environment on just two 1Gbps NICs due to bandwidth concerns & performance reasons you also shouldn’t do that on two teamed 1Gbps NICs either.
  • You can mix NIC of different speeds in the same team. Mind you, this is not necessarily a good idea. The best option is to use NICs of the same speed. Due to failover and load balancing needs and the fact you’d like some predictability in a production environment. In the lab this can be handy when you need to test things out or when you’d rather have this than no redundancy.

Things to keep in mind

SR-IOV & NIC teaming

Once you team NICs they do not expose SR-IOV on top of that. Meaning that if you want to use SR-IOV and need resilience for your network you’ll need to do the teaming in the guest. See the drawing higher up. This is fully supported and works fine. It’s not the easiest option to manage as it’s on a per guest basis instead of just on the host but the tip here is using the NIC Teaming UI on a host to manage the VM teams at the same time.  Just add the virtual machines to the list of managed servers.

image

Do note that teams created in a virtual machine can only run in Switch Independent configuration, Address Hash distribution mode. Only teams where each of the team members is connected to a different Hyper-V switch are supported. Which is very logical, as the picture below demonstrates, because you won’t have a redundant solution.

image

Security Features & Policies Break SR-IOV

Also note that any advanced feature like security policies on the (virtual) switch will disable SR-IOV, it has to or SR-IOV could be used as an effective security bypass mechanism. So beware of this when you notice that SR-IOV doesn’t seem to be working.

RDMA & NIC Teaming Do Not Mix

Now you need also to be aware of the fact that RDMA requires that each NIC has a unique IP addresses. This excludes NIC teaming being used with RDMA. So in order to get more bandwidth than one RDMA NIC can provide you’ll need to rely on Multichannel. But that’s not bad news.

TCP Chimney

TCP Chimney is not supported with network adapter teaming in Windows Server “8” Beta. This might change but I don’t know of any plans.

Don’t Go Overboard

Note that you can’t team teamed NIC whether it is in the host or parent or in virtual machines itself. There is also no support for using Windows NIC teaming to team two teams created with 3rd party (Intel or Broadcom) solutions. So don’t stack teams on top of each

Overview of Supported / Not Supported Features With Windows NIC Teaming

image

Conclusion

There is a lot more to talk about and a lot more to be tested and learned. I hope to get some more labs going and run some tests to see how things all fit together. The aim of my tests is to be ready for prime time when Windows 8 goes RTM. But buyer beware, this is  still “just” Beta material.

For more information please download the excellent whitepaper NIC Teaming (LBFO) in Windows Server "8" Beta

Integration Services Version Check Via Hyper-V Integration/Admin Event Log


I’ve written before (see "Key Value Pair Exchange WMI Component Property GuestIntrinsicExchangeItems & Assumptions") on the need to & ways with PowerShell to determine the version of the integration services or integration components running in your guests. These need to be in sync with the one running on the hosts. Meaning that all the hosts in a cluster should be running the same version as well as the guests.

During an upgrade with a service pack this get the necessary attention and scripts (PowerShell) are written to check versions and create reports and normally you end up with a pretty consistent cluster. Over time virtual machines are imported, inherited from another cluster of created on a test/developer host and shipped to production. I know, I know, this isn’t something that should happen, but I don’t always have the luxury of working in a perfect world.

Enough said. This means you might end up with guests that are not running the most recent version of the integration tools. Apart from checking manually in the guest (which is tedious, see my blog "Upgrading a Hyper-V R2 Cluster to Windows 2008 R2 SP1" on how to do this) or running previously mentioned script you can also check the Hyper-V event log.

Another way to spot virtual machines that might not have the most recent version of the integration tools is via the Hyper-V logs. In Server Manager you drill down in the “Diagnostics” to, “Event Viewer” and than navigate your way through  "Applications and Services Logs", "Microsoft", "Windows" until you hit “Hyper-V-Integration

image

Take a closer look and you’ll see the warning about 2 guests having an older version of the integration tools installed.

image

As you can see it records a warning for every virtual machine whose integration services are older than the host running Hyper-V. This makes it easy to grab a list of guest needing some attention. The down side is that you need to check all hosts, not to bad for a small cluster but not very efficient on the larger ones.

So just remember this as another way to spot virtual machines that might not have the most recent version of the integration tools. It’s not a replacement for some cool PowerShell scripting or the BPA tools, but it is a handy quick way to check the version for all the guests on a host when you’re in a hurry.

It might be nice if integration services version management becomes easier in the future. Meaning a built-in way to report on the versions in the guests and an easier way to deploy these automatically if there not part of a service pack (this is the case when the guest OS and the host OS differ or when you can’t install the SP in the guest for some application compatibility reason). You can do this in bulk using SCVMM and of cause Scripting this with PowerShell comes to the rescue here again, especially when dealing with hundreds of virtual machines in multiple large clusters. Orchestration via System Center Orchestrator can also be used. Integration with WSUS would be another nice option, for those that don’t have Configuration Manager or Orchestrator but that’s not supported as far as I know for now.

Anti Virus & Hyper-V Reloaded


The anti virus industry is both a blessing and a curse.  They protect us from a whole lot of security threats and at the same time they make us pay dearly for their mistakes or failures. Apart from those issues themselves this is aggravated that management does not see the protection it provides on a daily basis. Management only notices anti virus when things go wrong, when they lose productivity and money. And frankly when you consider scenarios like this one …

Hi boss, yes, I know we spent a 1.5 million Euros on our virtualization projects and it’s fully redundant to protect our livelihood. Unfortunately the anti virus product crashed the clusters so we’re out of business for the next 24 hours, at least.

… I can’t blame them for being a bit grumpy about it.

Recently some colleagues & partners in IT got bitten once again by McAfee with one of there patches (8.8 Patch 1 and 8.7 Patch 5). These have caused a lot of BSOD reports and they put the CSVs on Hyper-V clusters into redirected mode (https://kc.mcafee.com/corporate/index?page=content&id=KB73596). Sigh. As you can read here for the redirected mode issue they are telling us Microsoft will have to provide a hotfix. Now all anti virus vendors have their issue but McAfee has had too many issues for to long now.  I had hoped that Intel buying them would have helped with quality assurance but it clearly did not. This only makes me hope that whatever protection against malware is going to built into the hardware will be of a lot better quality as we don’t need our hardware destroying our servers and client devices. We’re also no very happy with the prospect or rolling out firmware & BIOS updates at the rate and with the risk of current anti virus products.

Aidan Finn has written before about the balance between risk & high availability when it comes to putting anti virus on Hyper-V cluster hosts and I concur with his approach:

  • When you do it pay attention to the exclusion & configuration requirements
  • Manage those host very carefully, don’t slap on just any update/patches and this includes anti virus products of cause

I’m have a Masters in biology from they days before I went head over heals into the IT business. From that background I’ve taken my approach to defending against malware. You have to make a judgment call, weighing all the options with their pros and cons. Compare this to vaccines/inoculations to protect the majority of your population. You don’t have to get a 100% complete coverage to be successful in containing an outbreak. Just a sufficiently large enough part including your most vulnerable and most at risk population. Excluding the Hyper-V hosts from mandatory anti virus fits this bill. Will you have 100% success, always? Forget it. There is no such thing.

Direct Connect iSCSI Storage To Hyper-V Guest Benefits From VMQ & Jumbo Frames


As I was preparing a presentation on Hyper-V cluster high available & high performance networking by, you guessed it, presenting it. During that presentation I mentioned Jumbo Frames & VMQ (VMDq in Intel speak)  for the virtual machine, Live Migration and CSV network. Jumbo frames are rather well know nowadays but VMQ is still something people have read about, at best have tinkered with, but no many are using it in production.

One of the reason for this that it isn’t explained and documented very well. You can find some decent explanation on what it is and does for you but that’s about it. The implementation information is woefully inadequate and, as with many advanced network features, there are many hiccups and intricacies. But that’s a subject for another blog post. I need some more input from Intel and or MSFT before I can finish that one.

Someone stated/asked that they knew that Jumbo frames are good for throughput on iSCSI networks and as such would also be beneficial to iSCSI networks provided to the virtual machines. But how about VMQ? Does that do anything at all for IP based storage. Yes it does. As a matter of fact It’s highly recommend by MSFT IT in one of their TechEd 2010 USA presentations on Hyper-V and storage.

So yes enable VMQ on both NIC ports used for iSCSI to the guest. Ideally these are two dedicated NICs connected to two separate switches to avoid a single point of failure. You do not need to team these on the host or have Multiple Path I/O (MPIO) running for this mat the parent level. The MPIO part is done in the virtual machines guests themselves as that’s where the iSCSI initiator lives with direct connect. And to address the question that followed, you can also use Multiple Connections per Session (MCS) in the guest if your storage device supports this but I must admit I have not seen this used in the wild. And then, finally coming to the point, both MPIO and MCS work transparently with Jumbo Frames and VMQ. So you’re good to go Smile

Using Host Names in IIS in Combination with a KEMP LoadMaster


At a client the change over of a web site from old servers to new ones lead to the investigation of an issue with the hardware load balancer. Since that web site is related to an existing surveyors solutions suite that already had a KEMP LoadMaster 2200 in use the figured we’d also use it for the web site and no longer use WNLB.

Now the original web site had multiple DNS entries and host header names defined in IIS (see Configure a Host Header for a Web Site (IIS 7)) . Host header names in IIS allow you to host multiple web sites on an IIS server using the same IP address and port. A small added security benefit is that surfing on IP address fails which means we marginally disrupt some script kiddies & get an extra security checkbox marked during an audit Winking smile.

In our example we needed:

Note: The real names have been changed as well as the reasons why as this has some business & historical justifications that don’t matter here.

ntrip.surveyor.lab needs to be handled by the load balanced web servers in the solution. The http://www.surveyor.lab needs to be redirected to another web server to keep the business happy. However for political reasons we have to keep the DNS record for http://www.surveyor.lab pointing to the load balanced servers, i.e. the load master VIP.

Now without host names IIS al worked fine until we wanted to use HTTP redirect. As the web site is the same IP address for both names we either redirected them both or none. To fix this we needed two sites in IIS. The real one hosting ntrip.surveyor.lab and a “fake” one hosting the http://www.surveyor.lab that we want to redirect. Well as both are hosted on the same IP address and port on the IIS server we need to use host names. But then the sites became unavailable.

When checking the LoadMaster configuration, the virtual service for the web servers seemed well.

image

Is this a limitation of hardware load balancing or this specific Loadmaster? Some searching on the internet made it look like I was about the only on on the planet dealing with this issue so no help there.

Kemp Support Rocks

I already knew this but this experience reaffirms it. KEMP Technologies really does care about their customers and are very fast & responsive. I threw a quick question on twitter to @KempTech on Twitter and they responded very fast with some pointers. After that I replied with some more details, they offered to take it on via other means as twitter has it limits. OK, no problems. The next morning I got an e-mail from one of their engineers (Ekkehard) with more information and a request for more input from our side. I quickly made a VISIO diagram of the current and the desired situation. Based on this he let me know this should work.

image

He asked for a copy of the configuration and already pointed to the solution:

And what exactly happens – does the RS turn “red” in the “View/Modify Services” view? That might be caused by the health check settings…
(Remember that a 302 is considered NOT ok, so you had to enter the proper check URL and or / HTTP1.1 hostname)

But at that moment I did not realize this yet. I saw no error or the real server turning red indicating it was down. So we went through the configuration and decided to test without forcing layer 7 to see what happened. This didn’t make a difference and it wasn’t really a solution if it had as we needed layer 7 and layer 7 transparency.

Ekkehard also noticed my firmware was getting rather old (don’t fix what isn’t broken Smile) and suggest an upgrade (5.1-24 to 5.1.-74). So I did, reboot and tested some more settings. To make sure I didn’t miss anything I threw a network sniffer (WireShark) against the issue. And guess what?  As soon as I added a host name to the IIS web site bindings I didn’t even get any request from my client on that server anymore. So it was definitely being stopped at the Loadmaster. Without it request from a client came through perfectly.  That was not IIS doing as with a host name nothing came into the server. So why would the LoadMaster stop traffic to a real server? Because it’s down, that’s why, just like Ekkehard has indicated in one of his mails but we didn’t see it then.

Better check again and sure enough, the health service told me the real servers are down. Hey … that’s new. Did the previous firmware not show this, or just slower? I can’t say for sure. It’s either me being to impatient, a hiccup, the firmware or premature dementia Confused smile

Root Cause

So what happens? The default health check uses HTTP 1.0. You can customize it with a path like  /owa or such but in essence it uses the IP address of the real server and guess what. With a Host header name in IIS that isn’t allowed other wise it can’t figure out what website you want to go to if you’re using this feature to run multiple sites on the same IP address and port. So we need to check the health based on host name. Can the LoadMaster do that for us? Yes it can!

The fix

You need to enable HTTP 1.1 and fill out the host name you want to use for health checking.  In our case that’s ntrip.surveyor.lab. That’s all there’s to it. Easy as can be if you know. And Ekkehard knew he indicated to this in his quoted mail above.

HTTP1 1host

 

Lessons Learned

So how did I not know this? Isn’t this documented? Sure enough on page  56 of the LoadMaster manual it says the following:

7  HTTP  The LoadMaster opens a TCP connection to the Real Server on the Service port (port80). The LoadMaster sends a HTTP/1.0 HEAD request the server, requesting the page ―/‖.  If the server sends a HTTP response with a status code of 2 (200-299, 301, 302, 401) the LoadMaster closes the connection and marks the server as active.  If the server fails to respond within the configured response time for the configured number of times or if it responds with a different status code, it is assumed dead.  HTTP 1.0 and 1.1 support available, using HTTP 1.1 allows you to check host header enabled web servers.

Typical, you read the exact line of information you need AND understand it after having figured it out. Now linking that information (yes we always read all manuals completely Embarrassed smile) to the situation at hand isn’t always that fast a process but I got there in the end with some help from KEMP Technologies.

One hint is perhaps to mention this is in the handy tips that pop up when you hover over a setting in the LoadMaster console. I rely on this a lot and a mention of “HTTP 1.1 allows you to check host header enabled web servers” might have helped me out. But it’s not there. A very poor excuse I know … Embarrassed smile

image

Host Header Names & HTTP redirection

After having fix this issue I proceeded to configure HTTP redirect in IIS 7.5. For this is used two sites. One was just a fake site tied to the www.surveyors.lab hostname in IIS on port 80.

image

For this site I created a HTTP redirect to www.bussines.lab/surveyors/services. This works just fine as long as you don’t forget the http:// in the redirect URL.

image

So it has to be http://www.bussines.lab/surveyors/services or you’ll get a funky loop effect looking like this:

http://www.surveyors.lab/www.bussines.lab/surveyors/services/www.bussines.lab/surveyors/services/www.bussines.lab/surveyors/services

Firefox will tell you you have a loop that will never end but Internet Explorer doesn’t, it just fails. You do get that URL as a pointer to the cause of the issue. That is if you can relate it to that.

The other was the real site  and was configured with following bindings and without redirection.

image

Don’t forget to do this on all real servers in the farm! The next thing I need to find out is how to health check two host names in the LoadMaster as I have two websites with the same IP address, port but different host names.

System Center Virtual Machine Manager 2008 R2 Error 12711 & The cluster group could not be found (0×1395)


The Issues

I recently had to go and fix some issues with a couple of virtual machines in SCVMM 2008 R2. There was one that failed to live migrate with following error:

Error (12711)
VMM cannot complete the WMI operation on server HopelessVm.test.lab because of
error: [MSCluster_ResourceGroup.Name=" df43bf60-7216-47ed-9560-7561d24c7dc8"] The cluster group could not be found.

(The cluster group could not be found (0×1395))
 
Recommended Action
Resolve the issue and then try the operation again

Other than that it looked fine and could be managed with SCVMM 2008 R2. Another one was totally wrecked it seemed. It was in a failed state after an attempted live migration. You couldn’t do anything with it anymore. Repair was “available” but every option there failed so basically that was the end of the game with that VM. Both issues can be resolved with the approach I’ll describe below.

The Cause

After some investigation the cause of this was the fact that this virtual machine had been removed from the failover cluster as a resource was exported & imported using Hyper-V manager on one of the cluster nodes. It was then added back to the failover cluster again to make them high available. All this was done without removing it from SCVMM 2008 R2. By the way, as mentioned above in “The Issues” this can get even worse than just failing live migrations. The same scenario can lead to virtual machines going into a failed state that you can’t repair (retry or undo fail) or ignore and basically you’re stuck at that point. You can’t even stop, start, shutdown the virtual machine anymore, not one single operation works in SCVMM while in the failover cluster GUI and in hyper-v manager everything is fully operational. This is important to note, as the services are fully on line and functional. It’s just in SCVMM that you’re in trouble.

Why did they do it this way? They did it to move the VM to a new CSV. The fact that you delete the VM files when deleting a VM with SCVmm2008R2 made them use Hyper-V manager instead. Now this approach (whatever you think of it) can work but then you need to delete the VM in SCVMM2008R2 after exporting the virtual machine AND before proceeding with the import and making the virtual machine highly available.

People get creative in how to achieve things due to inconsistencies, differences in functionality between Hyper-V Manger and SCVMM 2008R2 (in the latter especially the lack of complete control over naming, files & folders, export/migration behavior) as well as the needs of the failover cluster can lead to some confusing scenarios.

The Supported Fix

Now the easy way to fix this is to export the virtual machine again and delete it in SCVMM 2008 R2. That will remove the virtual machine object from SCVMM, the failover cluster en Virtual Machine Manager. However this virtual machine was so large (50Gb + 750 GB data disk) that there was no room for an export to be made. Secondly an export of such a large VM takes a considerable time and it has to be off line for this operation. This is annoying as SCVMM might be uncooperative at this point, the virtual machine is online en performing it’s duties for the business. So this presented us with a bit of a problem. Stopping the virtual machine, Exporting it using Hyper-V Manager will cause it to go missing in SCVMM 2012 and then you can delete it, importing the virtual machine again and adding it to the failover cluster causes down time.

The Root Cause

Why does this happen? Well when you import a virtual machine into a failover cluster is creates a new unique ID for the virtual machine Resource Group . This happens always. Choosing to reuse an existing ID during import in Hyper-V Manager has nothing to do with this. But VMM uses ID/names to identify a VM, independent of the cluster. So when you did not remove the VM from SCVMM before adding the VM back to the cluster you get a different cluster group ID in the cluster than you have in SCVMM. They both have the same name but there is a disconnect leading to the issues described above.

By the way exporting & importing a VM without first removing the virtual machine from the failover cluster leads to some issues in the Failover cluster so don’t do that either Smile

The “No Down Time” Fix

This is not the first time we need to dive in to the SCVMM database to fix issues. One of my main beef about SCVMM other than inconsistency with the other tools and its lack of control & options in some scenarios is the fact that it doesn’t have enough self-maintenance intelligence & functionality. This leads to the workaround above which are slow and rather annoying or consist of messing around in the SCVMM database, which isn’t exactly supported. Mind you Microsoft has published some T-SQL to clean up such issues themselves. See You cannot delete a missing VM in SCVMM 2008 or in SCVMM 2008 R2 and RemoveMissingVMs. See also my blog SCVMM 2008 R2 Phantom VM guests after Blue Screen post on this subject.

The usual tricks of the trade like refreshing the virtual machine configuration in the failover cluster GUI don’t work here. Neither does the solution to this error described Migrating a System Center Virtual Machine Manager 2008 VM from one cluster to another fails with error 12711. The error is the same but not the cause.

# Add the VMM cmdlets
Add-PSSnapin microsoft.systemcenter.virtualmachinemanager

# Connect to the VMM server
Get-VMMServer –ComputerName MySCVMMServer.test.lab

# Grab the problematic VM and put it into the object $vm
$vm = Get-VM –name “HopelessVM”

#Force a refresh
refresh-vm -force  $vm

In the end we have to fix the mismatch between the VMResourceGroupID in failover cluster and SCVMM by editing the database.

First you navigate to the registry key HKEY_LOCAL_MACHINE\Cluster\Groups\ on one the cluster nodes, do a find for the problematic VM’s name and grab the name of its key, this is the VMResourceGroupID the cluster knows and works with? So now we have the correct VMResourceGroupID: 0f8cabe4-f773-4ae4-b431-ada5a3c9926c

clip_image002

Now you connect to the SCVMM database and run following query to find the VMResourceGroupID that SCVMM thinks that VM has and that it uses causing the issues

SELECT  VMResourceGroupID  FROM tbl_WLC_VMInstance WHERE ComputerName = 'hopelessVM.test.lab'
GO 

The results:

VMResourceGroupID

————————————————–

df43bf60-7216-47ed-9560-7561d24c7dc8

(1 row(s) affected)

The trick than is to simply update that value to the one you just got from the registry by running:

UPDATE tbl_WLC_VMInstance SET VMResourceGroupID = '0f8cabe4-f773-4ae4-b431-ada5a3c9926c' WHERE VMResourceGroupID = 'df43bf60-7216-47ed-9560-7561d24c7dc8'
GO 

Than you need some patience & refresh the GUI a few times. Things will turn back to normal, but in between you might seem some “missing” statuses appear for your problematic VM. These go away fast however. If not you can always use the Microsoft provided script to remove missing VM’s as mentioned above in RemoveMissingVMs.

Warning

What I described above is something you can do to fix these issues fast and effectively when needed. But I’m not telling you this is the way to go, let alone that this is supported. Make sure you have backups of your VMs, Hosts, SCVMM database etc. It only takes one mistake or misinterpretation to royally shoot yourself in your foot Winking smile. It hurts like hell; recovery is long and seldom complete. On top of that it might generate a vacancy in your company whilst you’re escorted out of the building. Be careful out there.

Active-Active File sharing with SMB 2.2 Scale Out in Windows 8 Rocks


Introduction

Wow. That’s what I have to say. WOW! I configured a two node virtual machines 

cluster running Windows 8 Server Developer Preview to test the SMB2 Scale Out functionality and I smiling. In my previous blog Transparent Failover & Node Fault Tolerance With SMB 2.2 Tested I already tested the transparent failover with a more traditional active-passive file cluster and that was pretty neat. But there are two things to note:

  1. The most important one to me is that the experience with transparent failover isn’t as fluid for the end user as it should be in my opinion. That freeze is a bit to long to be comfortable. Whether that will change remains to be seen. It’s early days yet.
  2. The entire active-passive concept doesn’t scale very well to put it mildly. Whether this is important to you depends on your needs. Today one beefy well, configured server can server up a massive amount of data to a large number of users. So in  a lot of environments this might not be an issue at all (it’s OK not to be running a 300.000 user global file server infrastructure, really Winking smile).

So bring in “File Server For Scale-Out Application Data” which is an active/active cluster. This is intended for use by  applications like SQL server & Hyper.-V for example. It’s high speed and low drag high available file sharing based on SMB 2.2, Clusters Shared Volumes and failover clustering. The thing is, at this moment, it is not aimed at end user file sharing (hence it’s name ““File Server For Scale-Out Application Data”. When I heard that,  I was a going “come on Microsoft, get this thing going for end user data as well”. Now that I have tested this in the lab, I want this only more. Because the experience is much more fluid. So I have to ask Microsoft to please get this setup supported in a production environment for all file sharing purposes! This is so awesome as an experience for both applications AND end users. The other approach that would          work (except perhaps for scaling) is making the transparent failover for an active-passive file cluster more fluid. But again, early days yet.

Setting  Up The Lab

Build a “File Server for scale-out application data” cluster

You need three virtual machines running Windows 8, two to build the cluster and one to use as a client.Once you have the cluster you configure storage to be used as a Clustered Shared Volume (CSV)

image

You’ll see the progress bar adding the storage to CSV

image

And voila you have CSV storage configured. Note that you don’t have to enable it any more and that there are no more warnings that this is only supported for Hyper-V data.

image

Now navigate to Role, right click and select “Configure Roles”

image

This brings up the High Availability Wizard

image

Click Next and select “File Server for scale-out application data”

image

Give the Client Access Point a name

image

Click Next and on the following wizard page click confirm

image

And voila you’re done. Do notice the wizards skips the “Configure High Availability” step here.

image

Get a share up and running for use

Don’t make the mistake of trying to double click on the you see in the Role. Go to the node who’s the owner of the role and navigate to the role “ScaleOut”, right click and select add shared folder.

image

Select the cluster shared volume on the server “ScalingOut” which is actually the client access point.

image

I gave the share the name SOFS (Scale Out File Share)

image

I like Access Based Enumerations so I enable this next to Enable continuous availability that is enabled by default.

image

Than you get to the permissions settings. Here you have to make sue you set the share permissions to  more than read if you want to do some writing to the share. Nothing new here Winking smile

image

After that you’re almost done. Confirm your settings & click Commit

image

Watch the wizard do it’s magic

image

And it’s all setup

image

Play Time

We have a third node “Independence” running Windows 8 Server to use as a client. As you can see we can easily navigate  to the “server” via the access point.

image

And yes that’s about all you have to do. You can see the ease of name space management at work here.

Now let’s copy some data and turn of a one of the cluster nodes, the one that owns the role for example …

image

I was copying the content of the Windows 8 Server folder from Independence and failed over the node, the client did not notice anything. I turned off the node holding the role and still the client did only notice as short delay (a couple of seconds max). This was a complete transparent experience. I cannot stress enough how much I want this technology for my business customers. You can patch, repair, replace, file server nodes at will at any given moment en no application or user has to notice a thing. People, this is Walhalla. This is is the place where brave file server administrators that have served their customers well over the years against all odds have the right to go. They’ve earned this! Get this technology in their hands and yes even for end user file data. Or at least make the transparent failover for user file sharing as fluid. Make it happen Microsoft! And while I’m asking, will there ever be a SMB 2.2 installable client for Windows 7? In SP2, please?!

Learn more here by watching the sessions from the Build conference at http://www.buildwindows.com/Sessions

Noticed bugs

The shares don’t always show up in the share pane, after failover.

Conclusion

This is awesome, this is big, this is a game changer in the file serving business. Listen, file services are not dead, far from it. It wasn’t very sexy and we didn’t get the holey grail of high availability for that role as of yet until now. I have seen the future and it looks great. Set up a lab people and play at will. Take down servers in any way imaginable and see your file activities survive without at hint of disruption. As long a you make sure that you have multiple nodes in the cluster and that if these are virtual machines they always reside on different nodes in a failover cluster it will take a total failure of the entire cluster to bring you file services down. So how do you like them apples?

Data Protection & Disaster Recovery in Windows 8 Server Hyper-V 3.0


The news coming in from the Build Windows conference is awesome. The speculation of the last months is being validated by what is being told and on top of that more goodness is thrown at us Hyper-V techies.

On the data protection and disaster recovery front some great new weapons are at our disposal. Let’s take a look at some of them.

Live Migration & Storage Live Migration.

Among the goodies are the improvements in Live Migration and the introduction of Storage Live Migration.  Hyper-V 3.0 supports multiple concurrent Live Migrations now, which combined with adequate bandwidth will provide for fast evacuation of problematic hosts. Storage Live Migration means you can move a VM (configuration, VHD & snapshots) to different storage while the guest remains on line so the users are not hindered by this. I’m trying to find out if they will support multiple networks / NICs  with this.

Now to make this shine even more MSFT has another ace up it’s sleeve. You can do Live Migration and Storage Live Migration without the requirement of shared storage on the backend. This combination is a big one. This is means “shared nothing” high availability. Even now when prices for entry level shard storage has plummeted we see SMB being weary of SAN technology. It’s foreign to them and the fact they haven’t yet gained any confidence with the technology makes them hesitant. Also the real or perceived complexity might hold ‘m back. For that segment of the market it is now possible to have high availability anyway with the combo Live Migration / Storage Migration.  Add to this that Hyper-V now supports running virtual machines on a file share and you can see the possibilities of NAS appliances in this space of the market for achieving some very nice solutions.

Replication to complete the picture

To top this of you have replication built in, meaning we have the possibility to provide reasonably fast disaster recovery. It might not be real time data center fail over but a lot of clients don’t need that. However, they do need easy recoverability and here it is. To give you even more options, especially  if you only have one location, you can replicate to the cloud.

So now I start dreaming Smile We have shared nothing Live & Storage Live  Migration, we have replication. What could achieve with this? Do synchronous replication locally over a 10Gbps for example and use that to build something like continuous availability. There we go, we already have requirements for “Windows 8 Server R2”!

NIC Teaming in the OS

No more worries about third party NIC teaming woes. It has arrived in the OS (finally!) and it will support load balancing & failover. I welcome this, again it makes this a lot more feasible for the SMB shops.

IP Virtualization / Address Mobility

Another thing that will aid with any kind of of site  disaster recovery / high availability is IP address Mobility. You have an IP for the hosting of the VM and one for internal use by VM. That means you can migrate to other environments (cloud, remote site) with other addresses as the VM can change the hosted IP address, while the internal IP address remains the same.  Just imagine the flexibility this gives us during maintenance, recovery, trouble shooting network infrastructure issues and all this without impacting the users who depend on the VM to get their job done.

Conclusion

Everything we described is out of the box with Windows 8 Server Hyper-V. To a lot of business this can  mean a  huge improvement in their current  availability and disaster recovery situation. More than ever there is now no more reason for any company to go down or even out of business due to catastrophic data loss as all this technology is available on site, in hybrid scenarios and in the cloud with the providers.

Assigning Large Memory To Virtual Machine Fails: Event ID 3320 & 3050


We had a kind reminder recently that we shouldn’t forget to complete all steps in a Hyper-V cluster node upgrade process. The proof of a plan lies in the execution Smile. We needed to configure a virtual machine with a whooping 50GB of memory for an experiment. No sweat, we have plenty of memory in those new cluster nodes. But when trying to do so it failed with a rather obscure error in System Center Virtual Machine Manager 2008 R2

Error (12711)

VMM cannot complete the WMI operation on server hypervhost01.lab.test because of error: [MSCluster_Resource.Name="Virtual Machine MYSERVER"] The group or resource is not in the correct state to perform the requested operation.

(The group or resource is not in the correct state to perform the requested operation (0x139F))

Recommended Action

Resolve the issue and then try the operation again.

image

One option we considered was that SCVMM2008R2 didn’t want to assign that much memory as one of the old host was still a member of the cluster and “only” has 48GB of RAM. But nothing that advanced was going on here. Looking at the logs found the culprit pretty fast: lack of disk space.

We saw following errors in the Microsoft-Windows-Hyper-V-Worker-Admin event log:

Log Name:      Microsoft-Windows-Hyper-V-Worker-Admin
Source:        Microsoft-Windows-Hyper-V-Worker
Date:          17/08/2011 10:30:36
Event ID:      3050
Task Category: None
Level:         Error
Keywords:     
User:          NETWORK SERVICE
Computer:      hypervhost01.lab.test
Description:
‘MYSERVER’ could not initialize memory: There is not enough space on the disk. (0×80070070). (Virtual machine ID DEDEFFD1-7A32-4654-835D-ACE32EEB60EE)

Log Name:      Microsoft-Windows-Hyper-V-Worker-Admin
Source:        Microsoft-Windows-Hyper-V-Worker
Date:          17/08/2011 10:30:36
Event ID:      3320
Task Category: None
Level:         Error
Keywords:     
User:          NETWORK SERVICE
Computer:      hypervhost01.lab.test
Description:
‘MYSERVER’ failed to create memory contents file ‘C:\ClusterStorage\Volume1\MYSERVER\Virtual Machines\DEDEFFD1-7A32-4654-835D-ACE32EEB60EE\DEDEFFD1-7A32-4654-835D-ACE32EEB60EE.bin’ of size 50003 MB. (Virtual machine ID DEDEFFD1-7A32-4654-835D-ACE32EEB60EE)

Sure enough a smaller amount of memory, 40GB, less than the remaining disk space on the CSV did work. That made me remember we still needed to expand the LUNS on the SAN to provide for the storage space to store the large BIN files associated with these kinds of large memory configurations. Can you say "luxury problems"? The BIN file contains the memory of a virtual machine or snapshot that is in a saved state. Now you need to know that the BIN file actually requires the same disk space as the amount of physical memory assigned to a virtual machine. That means it can require a lot of room. Under "normal" conditions these don’t get this big and we provide a reasonable buffer of free space on the LUNS anyway for performance reasons, growth etc. But this was a bit more than that buffer could master.

As it was stated in the planning that we needed to expand the LUNS a bit to be able to deal with this kind of memory hogs this meant that the storage to do so was available and the LUN wasn’t maxed out yet. If not, we would have been in a bit of a pickle.

So there you go a real life example of what Aidan Finn warns about when using dynamic memory. Also see KB 2504962 “Dynamic Memory allocation in a Virtual Machine does not change although there is available memory on the host” which discusses the scenario where dynamic memory allocation seems not to work due to lack of disk space. Don’t forget about your disk space requirements for the bin files when using virtual machines with this much memory assigned. They tend to consume considerable chunks of your storage space. And even if you don’t forget about it in your planning, please don’t forget the execute every step of the plan Winking smile

Introducing 10Gbps & Integrating It into Your Network Infrastructure (Part 4/4)


This is a 4th post in a series of 4. Here’s a list of all parts:

  1. Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
  2. Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
  3. Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
  4. Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

In my blog post “Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)” in a series of thoughts on 10Gbps and Hyper-V networking a discussion on NIC teaming brought up the subject of 10Gbps for virtual machine networks. This means our switches will probably no longer exist in isolation unless those virtual Machines don’t ever need to talk to anything outside what’s connected to those switches. This is very unlikely. That means we need to start thinking and talking about integrating the 10Gbps switches in our network infrastructure. So we’re entering the network engineers their turf again and we’ll need to address some of their concerns. But this is not bad news as they’ll help us prevent some bad scenarios.

Optimizing the use of your 10Gbps switches

As not everyone runs clusters big enough, or enough smaller clusters, to warrant an isolated network approach for just cluster networking. As a result you might want to put some of the remaining 10Gbps ports to work for virtual machine traffic. We’ve already pointed out that your virtual machines will not only want to talk amongst themselves (it’s a cluster and private/internal networks tend to defeat the purpose of a cluster, it just doesn’t make any sense as than they are limited within a node) but need to talk to other servers on the network, both physical and virtual ones. So you have to hook up your 10Gbps switches from the previous example to the rest of the network. Now there are some scenarios where you can keep the virtual machine networks isolated as well within a cluster. In your POC lab for example where you are running a small 100% virtualized test domain on a cluster in a separate management domain but these are not the predominant use case.

But you don’t only have to have to integrate with the rest of your network, you may very well want to! You’ve seen 10Gbps in action for CSV and Live Migration and you got a taste for 10Gbps now, you’re hooked and dream of moving each and every VM network to 10Gbps as well. And while your add it your management network and such as well. This is nothing different from the time you first got hold of 1Gbps networking kit in a 100 Mbps world. Speed is addictive, once you’re hooked you crave for more Smile

How to achieve this? You could do this by replacing the existing 1Gbps switches. That takes money, no question about it. But think ahead, 10Gbps will be common place in a couple of years time (read prices will drop even more). The servers with 10Gbps LOM cards are here or will be here very soon with any major vendor. For Dell this means that the LOM NICs will be like mezzanine cards and you decide whether to plug in 10Gbps SPF+ or Ethernet jacks. When you opt to replace some current 1Gbps switches with 10gbps ones you don’t have to throw them away. What we did at one location is recuperate the 1Gbps switches for out of band remote access (ILO/DRAC cards) that in today’s servers also run at 1Gbps speeds. Their older 100Mbps switches where taken out of service. No emotional attachment here. You could also use them to give some departments or branch offices 1gbps to the desktop if they don’t have that yet.

When you have ports left over on the now isolated 10Gbps switches and you don’t have any additional hosts arriving in the near future requiring CSV & LM networking you might as well use those free ports. If you still need extra ports you can always add more 10Gbps switches. But whatever the case, this means up linking those cluster network 10Gbps switches to the rest of the network. We already mentioned in a previous post the network people might have some concerns that need to be addressed and rightly so.

Protect the Network against Loops & Storms

The last thing you want to do is bring down your entire production network with a loop and a resulting broadcast storm. You also don’t want the otherwise rather useful spanning tree protocol, locking out part of your network ruining your sweet cluster setup or have traffic specifically intended for your 10Gbps network routed over a 1Gbps network instead.

So let us discuss some of the ways in which we can prevent all these bad things from happening. Now mind you, I’m far from an expert network engineer so to all CCIE holders stumbling on to this blog post, please forgive me my prosaic network insights. Also keep in mind that this is not a networking or switch configuration course. That would lead us astray a bit too far and it is very dependent on your exact network layout, needs, brand and model of switches etc.

As discussed in blog post Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4) you need a LAG between your switches as the traffic for the VLANs serving heartbeat, CSV, Live Migration, virtual machines but now also perhaps the host management and optional backup network must flow between the switches. As long as you have only two switches that have a LAG between them or that are stacked you have not much risk of creating a loop on the network. Unless you uplink two ports directly with a network cable. Yes, that happens, I once witnessed a loop/broadcast storm caused by someone who was “tidying up” spare CAT5E cables buy plugging all ends up into free port switches. Don’t ask. Lesson learned: disable every switch port not in use.

Now once you uplink those two or more 10Gbps switches to your other switches in a redundant way you have a loop. That’s where the Spanning Tree protocol comes in. Without going into detail this prevents loops by blocking the redundant paths. If the operational path becomes unavailable a new path is established to keep network traffic flowing. There are some variations in STP. One of them is Rapid Spanning Tree Protocol (RSTP) that does the same job as STP but a lot faster. Think a couple of seconds to establish a path versus 30 seconds or so. That’s a nice improvement over the early days. Another one that is very handy is the Multiple Spanning Tree Protocol (MSTP). The sweet thing about the latter is that you have blocking per VLANs and in the case of Hyper-V or storage networks this can come in quite handy.

Think about it. Apart from preventing loops, which are very, very bad, you also like to make sure that the network traffic doesn’t travel along unnecessary long paths or over links that are not suited for its needs. Imagine the Live Migration traffic between two nodes on different 10Gbps switches travelling over the 1Gbps uplinks to the 1Gbps switches because the STP blocked the 10Gbps LAG to prevent a loop. You might be saturating the 1Gbps infrastructure and that’s not good.

I said MSTP could be very handy, let’s address this. You only need the uplink to the rest of the network for the host management and virtual machine traffic. The heartbeat, CSV and Live Migration traffic also stops flowing when the LAG between the two 10Gbps switches is blocked by the RSTP. This is because RSTP works on the LAG level for all VLANs travelling across that LAG and doesn’t discriminate between VLAN. MSTP is smarter and only blocks the required VLANS. In this case the host management and virtual machines VLANS as these are the only ones travelling across the link to the rest of the network.

We’ll illustrate this with some picture based on our previous scenarios. In this example we have the cluster network going to the 10Gbps switches over non teamed NICs. The same goes for the virtual machine traffic but the NICs are teamed, and the host management NICS. Let’s first show the normal situation.

 clip_image002

Now look at a situation where RSTP blocks the purple LAG. Please do note that if the other network switches are not 10Gbps that traffic for the virtual machines would be travelling over more hops and at 1Gbps. This should be avoided, but if it does happens, MSTP would prevent an even worse scenario. Now if you would define the VLANS for cluster network traffic on those (orange) uplink LAGs you can use RSTP with a high cost but in the event that RSTP blocks the purple LAG you’d be sending all heartbeat, CSV and Live Migration traffic over those main switches. That could saturate them. It’s your choice.

clip_image004

In the picture below MSTP saves the day providing loop free network connectivity even if spanning tree for some reasons needs to block the LAG between the two 10Gbps switches. MSTP would save your cluster network traffic connectivity as those VLAN are not defined on the orange LAG uplinks and MSTP prevents loops by blocking VLAN IDs in LAGs not by blocking entire LAGs

clip_image006

To conclude I’ll also mention a more “star like” approach in up linking switches. This has a couple of benefits especially when you use stackable switches to link up to. They provide the best bandwidth available for upstream connections and they provide good redundancy because you can uplink the lag to separate switches in the stack. There is no possibility for a loop this way and you have great performance on top. What’s not to like?

clip_image008

Well we’ve shown that each network setup has optimal, preferred network traffic paths. We can enforce these by proper LAG & STP configuration. Other, less optimal, paths can become active to provide resiliency of our network. Such a situation must be addressed as soon as possible and should be considered running on “emergency backup”. You can avoid such events except for the most extreme situations by configuring the RSTP/MSTP costs for the LAG correctly and by using multiple inter switch links in every LAG. This does not only provide for extra bandwidth but also protects against cable or port failure.

Conclusion

And there you have it, over a couple of blog posts I’ve taken you on a journey through considerations about not only using 10Gbps in your Hyper-V cluster environments, but also about cluster networks considerations on a whole. Some notes from the field so to speak. As I told you this was not a deployment or best practices guide. The major aim was to think out loud, share thoughts and ideas. There are many ways to get the job done and it all depends on your needs an existing environment. If you don’t have a network engineer on hand and you can’t do this yourself; you might be ready by now to get one of those business ready configurations for your Hyper-V clustering. Things can get pretty complex quite fast. And we haven’t even touched on storage design, management etc.. The purpose of these blog was to think about how Hyper-V clustering networks function and behave and to investigate what is possible. When you’re new to all this but need to make the jump into virtualization with both feet (and you really do) a lot help is available. Most hardware vendors have fast tracks, reference architectures that have a list of components to order to build a Hyper-V cluster and more often than not they or a partner will come set it all up for you. This reduces both risk and time to production. I hope that if you don’t have a green field scenario but want to start taking advantage of 10Gbps networking; this has given you some food for thought.

I’ll try to share some real life experiences,what improvements we actually see, with 10Gbps speeds in a future blog post.