SMB Direct over RoCE Demo – Hosts & Switches Configuration Example


As mentioned in Where SMB Direct, RoCE, RDMA & DCB fit into the stack this post’s only function is to give you an overview of the configurations used in the demo blogs/videos. First we’ll configure one Windows Server 2012 R2 host. I hope it’s clear this needs to be done on ALL hosts involved. The NICs we’re configuring are the 2 RDMA capable 10GbE NICs we’ll use for CSV traffic, live migration and our simulated backup traffic. These are Mellanox ConnectX-3 RoCE cards we hook up to a DCB capable switch. The commands needed are below and the explanation is in the comments. Do note that the choice of the 2 policies, priorities and minimum bandwidths are for this demo. It will depend on your environment what’s needed.

#Install DCB on the hosts
Install-WindowsFeature Data-Center-Bridging
#Mellanox/Windows RoCE drivers don't support DCBx (yet?), disable it.
Set-NetQosDcbxSetting -Willing $False
#Make sure RDMA is enable on the NIC (should be by default)
Enable-NetAdapterRdma –Name RDMA-NIC1
Enable-NetAdapterRdma –Name RDMA-NIC2
#Start with a clean slate
Remove-NetQosTrafficClass -confirm:$False
Remove-NetQosPolicy -confirm:$False

#Tag the RDMA NIC with the VLAN chosen for PFC network
Set-NetAdapterAdvancedProperty -Name "RDMA-NIC-1" -RegistryKeyword "VlanID" -RegistryValue 110
Set-NetAdapterAdvancedProperty -Name "RDMA-NIC-2" -RegistryKeyword "VlanID" -RegistryValue 120

#SMB Direct traffic to port 445 is tagged with priority 4
New-NetQosPolicy "SMBDIRECT" -netDirectPortMatchCondition 445 -PriorityValue8021Action 4
#Anything else goes into the "default" bucket with priority tag 1 🙂
New-NetQosPolicy "DEFAULT" -default  -PriorityValue8021Action 1

#Enable PFC (lossless) on the priority of the SMB Direct traffic.
Enable-NetQosFlowControl -Priority 4
#Disable PFC on the other traffic (TCP/IP, we don't need that to be lossless)
Disable-NetQosFlowControl 0,1,2,3,5,6,7

#Enable QoS on the RDMA interface
Enable-NetAdapterQos -InterfaceAlias "RDMA-NIC1"
Enable-NetAdapterQos -InterfaceAlias "RDMA-NIC2"

#Set the minimum bandwidth for SMB Direct traffic to 90% (ETS, optional)
#No need to do this for the other priorities as all those not configured
#explicitly goes in to default with the remaining bandwith.
New-NetQoSTrafficClass "SMBDirect" -Priority 4 -Bandwidth 90 -Algorithm ETS

We also show you in general how to setup the switch. Don’t sweat the exact syntax and way of getting it done. It differs between switch vendors and models (we used DELL Force10 S4810 and PowerConnect 8100 / N4000 series switches), it’s all very alike and yet very specific. The important thing is that you see how what you do on the switches maps to what you did on the hosts.

!Disable 802.3x flow control (global pause)- doesn't mix with DCB/PFC
workinghardinit#configure
workinghardinit(conf)#interface range tengigabitethernet 0/0 -47 
workinghardinit(conf-if-range-te-0/0-47)#no flowcontrol rx on tx on
workinghardinit(conf-if-range-te-0/0-47)# exit
workinghardinit(conf)# interface range fortyGigE 0/48 , fortyGigE 0/52
workinghardinit(conf-if-range-fo-0/48-52)#no flowcontrol rx on tx off
workinghardinit(conf-if-range-fo-0/48-52)#exit

!Enable DCB & Configure VLANs
workinghardinit(conf)#service-class dynamic dot1p
workinghardinit(conf)#dcb enable
workinghardinit(conf)#exit
workinghardinit#copy running-config startup-config
workinghardinit#reload

!We use a <> VLAN per subnet
workinghardinit#configure
workinghardinit(conf)#interface vlan 110
workinghardinit (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
workinghardinit (conf-if-vl-vlan-id*)#tagged port-channel 3
workinghardinit(conf)#interface vlan 120
workinghardinit (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
workinghardinit (conf-if-vl-vlan-id*)#tagged port-channel 3
workinghardinit (conf-if-vl-vlan-id*)#exit


!Create & configure DCB Map Policy
workinghardinit(conf)#dcb-map SMBDIRECT
workinghardinit(conf-dcbmap-profile-name*)#priority-group 0 bandwidth 90 pfc on 
workinghardinit(conf-dcbmap-profile-name*)#priority-group 1 bandwidth 10 pfc off 
workinghardinit(conf-dcbmap-profile-name*)#priority-pgid 1 1 1 1 0 1 1 1
workinghardinit(conf-dcb-profile-name*)#exit 

!Apply DCB map to the switch ports & uplinks
workinghardinit(conf)#interface range ten 0/047
workinghardinit(conf-if-range-te-0/0-47)# dcb-map SMBDIRECT 
workinghardinit(conf-if-range-te-0/0-47)#exit
workinghardinit(conf)#interface range fortyGigE 0/48 , fortyGigE 0/52
workinghardinit(conf-if-range-fo-0/48,fo-0/52)# dcb-map SMBDIRECT
workinghardinit(conf-if-range-fo-0/48,fo-0/52)#exit
workinghardinit(conf)#exit
workinghardinit#copy running-config startup-config 

With the hosts and the switches configured we’re ready for the demos in the next two blog posts. We’ll show Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS) in action with some tips on how to test this yourselves.

Advertisements

SMB Direct with DCB, PFC, ETS … How do I know it works?!


A question that comes up over time, again and again, is how do you know SMB Direct is working. The question stems from a nagging feeling that configuring DCB is a bit of playing wizard’s apprentice and we might not completely know what we’re doing, i.e. lack of experience.

image

Many have suspected me of brewing up DCB configurations in a dark corner of the data center where no one else dares venture. But those are unsubstantiated rumors. But in coming blog posts we’ll address how to configure it end to end and we’ll show how to find out if it’s really working and how to test that.

Finding out if it really works, testing and monitoring isn’t magic. It boils down to using tools you know. Performance counters for RDMA Activity and SMB direct are natively available in Windows. Use them!The NIC vendors also provide very detailed counters, those are excellent and of great value when testing and confirming things work as they should. The latter is very important. Because after people are satisfied SMB Direct works they want to know if DCB is configured correctly. Does PFC work, are pause frames being send and received? Is it really lossless?  Does ETS really kick in when needed, do I get the minimum bandwidth I configured? These are very valid questions people struggle with. But the answer eludes many, almost like the question if the refrigerator light really goes out when you close the door.

It’s hard to do deep down in the network packets … that often requires a very specialized skillset and experience with packet analyzers etc. Nothing most of you can’t learn but often this is not a priority. But with some creativity and the performance counters on windows provided by the NIC vendors and the statistics counters on the switches you can demonstrate that both PFC & ETS doe work and kick in.

So in upcoming blogs & videos I’ll demonstrate the configuring SMB Direct over RoCE leveraging 2 parts of DCB:

  • PFC (Priority Flow Control) – mandatory for SMB Direct over RoCE
  • ETS (Enhanced Transmission Selection) – optional but I advise you to leveraged it for SMB Direct over RoCE

Actually, when doing true converged, no matter what route you go, QoS is not really optional any more.

The biggest challenge is to get people to wrap their heads around the concepts and it’s behavior. Once you do that you’ll understand how and why to configure it. It took me time and effort, there’s no way around it, but it’s well worth the effort.

Look, DCB is not 100% fully matured or perfect especially in large scale environments over > 2 or 3 hops. Frak, while I love tinkering, testing and playing with this stuff I have never been a “QoS first person”. If I can I thrown resources at the problem (CPU cycles; memory, bandwidth, …). QoS is like a gun. You only draw it when you must use it and than you’d better do it right otherwise you don’t touch it, bar for practice/training/ education. While perfection is not of this world and improvements are being worked on (ECN) it does work and deliver. How many of you had a large scale > 2 hops , > 20 switches deployment with FC, FCoE or iSCSI to worry about? So can it deliver what you need today in most scenarios? Yes! Can I fix the short comings of any random technologies? No. Can I leverage current technologies with great success despite this? Yes! So can you. There is a reason I get hired and paid. Trust me it’s not my looks, my bed side manner or charismatic appearance Winking smile.

Side note 1: I’m cannot possibly provide a switch configuration guide in a step by step fashion as the details vary by vendor, they can also be switch model/type specific and it all depends on your environment & needs. So no I cannot and will not attempt to write a bunch of these. This would be way too much work and way too expensive (time, hardware etc.), so unless I’m paid very generously to do so, you’re out of luck. It might be cheaper to hire me or to come to the free community sessions, presentations, ATE evenings and study up.

Optimizing Backups: PowerShell Script To Move All Virtual Machines On A Cluster Shared Volume To The Node Owing That CSV


When you are optimizing the number of snapshots to be taken for backups or are dealing with storage vendors software that leveraged their hardware VSS provider you some times encounter some requirements that are at odds with virtual machine mobility and dynamic optimization.

For example when backing up multiple virtual machines leveraging a single CSV snapshot you’ll find that:

  • Some SAN vendor software requires that the virtual machines in that job are owned by the same host or the backup will fail.
  • Backup software can also require that all virtual machines are running on the same node when you want them to be be protected using a single CSV snap shot. The better ones don’t let the backup job fail, they just create multiple snapshots when needed but that’s less efficient and potentially makes you run into issues with your hardware VSS provider.

image

VEEAM B&R v8 in action … 8 SQL Server VMs with multiple disks on the same CSV being backed up by a single hardware VSS writer snapshot (DELL Compellent 6.5.20 / Replay Manager 7.5) and an off host proxy Organizing & orchestrating backups requires some effort, but can lead to great results.

Normally when designing your cluster you balance things out a well as you can. That helps out to reduce the needs for constant dynamic optimizations. You also make sure that if at all possible you keep all files related to a single VM together on the same CSV.

Naturally you’ll have drift. If not you have a very stable environment of are not leveraging the capabilities of your Hyper-V cluster. Mobility, dynamic optimization, high to continuous availability are what we want and we don’t block that to serve the backups. We try to help out to backups as much a possible however. A good design does this.

If you’re not running a backup every 15 minutes in a very dynamic environment you can deal with this by live migrating resources to where they need to be in order to optimize backups.

Here’s a little PowerShell snippet that will live migrate all virtual machines on the same CSV to the owner node of that CSV. You can run this as a script prior to the backups starting or you can run it as a weekly scheduled task to prevent the drift from the ideal situation for your backups becoming to huge requiring more VSS snapshots or even failing backups. The exact approach depends on the storage vendors and/or backup software you use in combination with the needs and capabilities of your environment.

cls

$Cluster = Get-Cluster
$AllCSV = Get-ClusterSharedVolume -Cluster $Cluster

ForEach ($CSV in $AllCSV)
{
    write-output "$($CSV.Name) is owned by $($CSV.OWnernode.Name)"
    
    #We grab the friendly name of the CSV
    $CSVVolumeInfo = $CSV | Select -Expand SharedVolumeInfo
    $CSVPath = ($CSVVolumeInfo).FriendlyVolumeName

    #We deal with the  being and escape character for string parsing.
    $FixedCSVPath = $CSVPath -replace '\', '\'

    #We grab all VMs that who's owner node is different from the CSV we're working with
    #From those we grab the ones that are located on the CSV we're working with
      $VMsToMove = Get-ClusterGroup | ? {($_.GroupType –eq 'VirtualMachine') -and ( $_.OwnerNode -ne $CSV.OWnernode.Name)} | Get-VM | Where-object {($_.path -match $FixedCSVPath)} 
     
    ForEach ($VM in $VMsToMove)

    {
        write-output "`tThe VM $($VM.Name) located on $CSVPath is not running on host $($CSV.OwnerNode.Name) who owns that CSV"
        write-output "`tbut on $($VM.Computername). It will be live migrated."
        #Live migrate that VM off to the Node that owns the CSV it resides on
        Move-ClusterVirtualMachineRole -Name $VM.Name -MigrationType Live -Node $CSV.OWnernode.Name
    }

Now there is a lot more to discuss, i.e. what and how to optimize for virtual machines that are clustered. For optimal redundancy you’ll have those running on different nodes and CSVs. But even beyond that, you might have the clustered VMs running on different cluster, which is the failure domain.  But I get the remark my blogs are wordy and verbose so … that’s for another time Winking smile

SMB Direct With RoCE in a Mixed Switches Environment


I’ve been setting up a number of Hyper-V clusters with  Mellanox ConnectX3 Pro dual port 10Gbps Ethernet cards. These Mellanox cards provide a nice amount of queues (128) for DVMQ and also give us RDMA/SMB Direct capabilities for CSV & live migration traffic.

Mixed Switches Environments

Now RoCE and DCB is a learning curve for all of us and not for the faint of heart. DCB configuration is non trivial, certainly not across multiple hops and different switches. Some say it’s to be avoided or can’t be done.

You can only get away with a single pair of (uniform) switches in smaller deployments. On top of that I’m seeing more and more different types of switches being used to optimize value, so it’s not just a lab exercise to do this. Combine this with the fact that DCB is an unavoidable technology in networking, unless it get’s replaced with something better and easier, and you might as well try and learn. So I did.

Well right now I’m successfully seeing RoCE traffic going across cluster nodes spread over different racks in different rows at excellent speeds. The core switches are DELL Force10 S4810 and the rack switches are PowerConnect 8132Fs. By borrowing an approach from spine/leave designs this setup delivers bandwidth where they need it a a price point they can afford. They don’t need more expensive switches for the rack or the core as these do support DCB and give the port count needed at the best price point.  This isn’t supposed to be the top in non blocking network design. Nope but what’s available & affordable today in you hands is better than perfection tomorrow. On top of that this is a functional learning experience for all involved.

We see some pause frames being sent once in a while and this doesn’t impact speed that very much. It does guarantee lossless traffic which is what we need for RoCE. When we live migrate 300GB worth of memory across the nodes in the different racks we get great results. It varies a bit depending on the load the switches & switch ports are under but that’s to be expected.

Now tests have shown us that we can live migrate just as fast with non RDMA 10Gbps as we can with RDMA leveraging “only” Multichannel. So why even bother? The name of the game low latency and preserving CPU cycles for SQL Server or storage traffic over SMB3. Why? We can just buy more CPUs/Cores. Great, easy & fast right? But then with SQL licensing comes into play and it becomes very expensive. Also storage scenarios under heavy load are not where you want to drop packets.

Will this matter in your environment? Great question! It depends on your environment. Sometimes RDMA is needed/warranted, sometimes it isn’t. But the Mellanox cards are price competitive and why not test and learn right? That’s time well spent and prepares you for the future.

But what if it goes wrong … ah well if the nodes fail to connect over RDAM you still have Multichannel and if the DCB stuff turns out not to be what you need or can handle, turn it of and you’ll be good.

RoCE stuff to test: Routing

Some claim it can’t be done reliably. But hey they said that for non uniform switch environments too Winking smile. So will it all fall apart and will we need to standardize on iWarp in the future?  Maybe, but isn’t DCB the technology used for lossless, high performance environments (FCoE but also iSCSI) so why would not iWarp not need it. Sure it works without it quite well. So does iSCSI right, up to a point? I see these comments a lot more form virtualization admins that have a hard time doing DCB (I’m one so I do sympathize) than I see it from hard core network engineers. As I have RoCE cards and they have become routable now with the latest firmware and drivers I’d love to try and see if I can make RoCE v2 or Routable RoCE work over different types of switches but unless some one is going to sponsor the hardware I can’t even start doing that. Anyway, lossless is the name of the game whether it’s iWarp or RoCE. Who know what we’ll be doing in 5 years? 100Gbps iWarp & iSCSI both covered by DCB vNext while FC, FCoE, Infiniband & RoCE have fallen into oblivion? We’ll see.

VEEAM Invests in Faster & More Efficient Data Protection With Backup & Replication 8


Ever more data to protect without breaking the systems or the bank

One of my major concerns today in IT, weather it is on premises or in the cloud, is the cost, time, reliability and feasibility of backup and restores. This true for most of us. Due to the environments in which I deliver my services my main issue with backups is the quantity of data. The amount of data is staggering and growth is not showing a downward trend.

The big four: CPU, Memory, Network & Storage

Over the years we have seen a vast increase in compute, memory, network and storage capabilities and pricing. CPUs are up to 18 cores per socket as I write this. DDR4 memory is here and the cost is relatively low. We have affordable 10Gbps networking to throw at the problem as well or in some case 8 to 16Gbps Fibre Channel. So when it comes to CPU, memory and network we’re pretty well served.

Storage is evolving as well and we’re getting ever bigger and, if you have the budget that is, faster storage arrays in different flavors. But it remains a challenge. First of all to get the right amount of IOPS and storage capacity at an affordable price point is a balancing act. Secondly when dealing with backups we need to manage the source IOPS & latency against the target. But that’s not all, while you might want to squeeze every last IOPS & 1ms latency out of your backup target you can’t carelessly do that to your source storage. If you do, this might constitute a Denial Of Service attack against your applications and services. Even today storage QoS is either non existent, in it’s infancy or at best limited to particular workloads on storage solutions.

The force multiplier: Backup software capabilities & approaches

If you’ve made sure the above 4 resources are not your killer bottle neck the backup software, methods algorithms and the approach used will be either your biggest problem or you best friends. You need your backup software to be:

  • Capable
  • Scalable
  • Fast
  • Configurable
  • Scale Out

There are some challenging environments out there. To deal with this backup software should be able to leverage the wealth of capabilities compute, network, memory & storage are offering to protect large amounts of data reliable and fast. This should be done smart and in an operationally supportable manner. VEEAM has been working on this for a long time and they keep getting better at this with every release and it allows for scale out designs in regards to backups targets.

VEEAM Backup & Replication 8.0

There are many improvements in v8 but a couple stand out.

image

Consistency groups (Hyper-V)

Backup jobs can execute more than one VM backup task simultaneously from the same volume snapshot with “Allow Processing of Multiple VMs with a single volume snapshot”.

image

This means you can reduce the number of snapshots significantly where in the past you needed a volume snapshot per VM. VEEAM limits the the maximum amount of VMs you can backup per snapshot to 4 when using software VSS and to eight with hardware VSS. They do this because under heavy load VSS/CSV sometimes has issues. This number can be tweaked to fit your needs (no all environments are created equally) with 2 registry values under HKLM\SOFTWARE\Veeam\Veeam Backup and Replication key:

  • MaxVmCountOnHvSoftSnapshot (DWORD)
  • MaxVmCountOnHvHardSnapshot (DWORD) registry values

Reducing the number of snapshots to be taken is good as it saves resources, speeds up things & as VSS can be finicky, not needing more than absolutely necessary is a good thing.

Backup I/O Control.

Another improvement is backup I/O Control which delivers capability to dynamically adjust the number of backup tasks based on IOPS latency. Under Options you’ll find a new Tabbed sheet, I/O Control. It contains the parallel processing option that used to be under “Advanced” tab in Veeam B&R 7.

image

The idea is to move to a more “policy driven” approach for handling the load backups can put on the storage. Until now we’d configure a number of X amounts of tasks to run against the source storage in order to keep IOPS/Latency in check. But this is very static and in a dynamic / elastic “cloud” world this isn’t very flexible nor is it feasible to keep tuned to the best number for the current workload.

I/O Control let’s you set limits on how much latency is acceptable for your data stores. Removing or adding VMs to the data store won’t invalidate your carefully set number of tasks allowed as it’s now the latency that’s used to dynamically tune that number for you.

I/O control has two settings:

 “Stop assigning new tasks to datastore at: X ms” :VEEAM looks at the latency (IOPS) before assigning a proxy (backup target) to a virtual disk or won’t launch the task until the load has dropped.  This prevents the depletion of IOPS by launching to many backups.

“Throttle I/O of existing tasks at: Y ms”: This will throttle the IO of already running  backup jobs when needed due to some application workloads in the VMs running on the source storage kicking in. The backups will be throttled so they’ll take longer but they won’t kill the performance of the applications while they are running.

These two setting allow for the dynamic and on the fly tweaking of the number of backups tasks running as well as their impact on the storage performance. Once you have determined what latency values are acceptable to you you’re done, VEEAM handles the tweaking for you. The default values seems to reflect industry best practices (sustained > 20 ms is considered problematic)

The below screenshot is for the backup job log and shows latency being monitoredclip_image002

With VEEMA B&R v8 Enterprise + You can even do this per data store, meaning you can optimize this per backup source. This recognizes that is no “one sizes fits all perfectly” and allows for differentiation. Yet it does so in a way that does not compromise on the simplicity of use that VEEAM offers. This sounds easy but from experience I know this isn’t. VEEAM manages to offer a great balance between simplicity and functionality for companies of all sizes.

Select “Configure”

image

In the “Datastore Latency Settings” you can add one, more or all data store you are protecting with VEEAM. This allows for differentiation when you have CSV that are used for SQL Server VMs versus stateless web servers of or other workloads that are not storage I/O intensive.

image

Select the datastore (in our case the CSV volumes in Hyper-V Cluster)

image

By selecting the desired datastore and clicking “Edit”  you can individually adjust the settings for that datastore.

image

Conclusion

It looks like we have some great additional capabilities in an already very good solution. I’ll be using these new capabilities in real life scenarios to see how these work out for us and optimize the backups of the virtualized environment under my care. Hardware VSS Providers, SANs, CSV’s normally need some tweaking and care to make them run well, so that’s what we’ll be doing.

Golden Nuggets: Windows Server 2012 R2 Failover Cluster CSV Placement Policy


Some enhancements only become truly evident to people when they see them in action. For many features this means something need to go wrong before they kick in. Others are more visible during normal operations. This is the case with the CSV enhancements in Windows Server 2012 R2 Failover Clustering.

One golden nugget here is the CSV placement policy (which really shines in combination with SOFS/Storage Spaces). This will spread ownership of the CSV amongst the cluster nodes to ensure a balanced distribution. In a failover cluster, one node is the “coordinator node” (owner) for a CSV. The coordinator node owns the physical disk resource that is associated with a logical unit (LUN). All I/O operations for the File System on that LUN are are through the coordinator node. In previous versions there is no automatic rebalancing of coordinator node assignment. This means that all LUNs could potentially be owned by the same node. In storage spaces & SOFS scenarios becomes even more important.

The benefits

  • It helps all nodes carry their share of the workload as it load balances the disk I/O.
  • Failovers of CSV owners are potentially quicker and more predictable/consistent as an even distribution ensures that no one node owns a disproportionate number of CSVs.
  • When losing storage access the number of CSVs that are in redirected mode is potentially less as they are evenly distributed. In an unbalanced cluster it could be for all of them in a worse case scenario.
  • When using SOFS with Storage Spaces it makes sure the Storage Spaces Ownership is distributed fairly.

When does it happen

  • Each time a node leaves or joins the cluster. This means you don’t need to intervene manually or via PowerShell to get an even distribution. This goes for both exiting nodes as when adding a new node. The new node will get a CSV assigned if there is any on surplus on one of the existing nodes.
  • The process also works when you start a failover cluster when it has shut down.

When customers see this in action (it’s most obvious when then add a node as then they are normally watching) they generally smile as the cluster does it job getting  the best possible results out of their hardware.

Copy Cluster Roles Hyper-V Cluster Migration Fails at Final Step with error Virtual Machine Configuration ‘VM01’ failed to register the virtual machine with the virtual machine service


I was working on a migration of a nice two node Windows Server 2012 Hyper-V cluster to Windows Server 2012 R2. The cluster consist out of 2 DELL R610 servers and a DELL  MD3200 shared SAS disk array for the shared storage. It runs all the virtual machines with infrastructure roles etc. It’s a Cluster In A Box like set up. This has been doing just fine for 18 months but the need for features in Windows Server 2012 R2 became too much to resists. As the hardware needs to be recuperated and we have a maintenance windows we use the copy cluster roles scenario that we have used so many times before with great success. It’s the Perform an in-place migration involving only two servers scenario documented on TechNet and as described in one of my previous blogs Migrating a Hyper-V Cluster to Windows 2012 R2 for your convenience.

Virtual Machine Configuration ‘VM01’ failed to register the virtual machine with the virtual machine service

As the source host was running on Windows Server 2012 we could have done the live migration scenario but the down time would be minimal and there is a maintenance window. So we chose this path.

So we performed a good health check. of the source cluster and made sure we had no snapshots left hanging around. Yes it’s supported now for this migration scenario but I like to have as few moving parts as possible during a migration.

It all went smooth like silk. After shutting down the VMs on the source cluster node, bringing the CSV off line (and un-presenting the LUN from the source node for good measure), we present that LUN to the target host. We brought the CSV on line and when that was completed successfully we were ready to bring the virtual machines on line and that failed …

Log Name:      Microsoft-Windows-Hyper-V-High-Availability-Admin
Source:        Microsoft-Windows-Hyper-V-High-Availability
Date:          4/02/2014 19:26:41
Event ID:      21102
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      VM01.domain.be
Description:
‘Virtual Machine Configuration VM01’ failed to register the virtual machine with the virtual machine management service.

image

image

 

Let’s dive into the other event logs. On the host the application security and system event log are squeaky clean. The Hyper-V event logs are pretty empty or clean to except for these events in the Hyper-V-VMMS Admin log.

Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          4/02/2014 19:26:40
Event ID:      13000
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      VM01.domain.be
Description:
User ‘NT AUTHORITY\SYSTEM’ failed to create external configuration store at ‘C:\ClusterStorage\HyperVStorage\VM01’: The trust relationship between this workstation and the primary domain failed.. (0x800706FD)

 

image

Bingo. It must be the fact that no domain controller is available. It’s completely self contained cluster and both domain controller virtual machines are highly available and reside on the CSV. Now the CSV does come on line without a DC since Windows Server 2012 so that’s not the issue. it’s the process of registering the VMs that fails without a DC in an Active Directory environment.

Getting passed this issue

There are multiple ways to resolve this and move ahead with our cluster migration. As the environment is still fully functional on the source cluster I just removed a DC virtual machine from high availability on the cluster. I shut it down and exported it. I than copied it over to the node of the new cluster  (we’re going to nuke the source host afterwards and install W2K12R2, so we moved it to the new host where it could stay) where I put it on local storage and imported it. For this is used the “Register the virtual machine in-place option”. I did not make it high available.

image

After verifying that we could ping the DC and it was up and running well we tried the final phase of the migration again. It went as smooth as we have come to expect!

Other options would have been to host the DC virtual machine on a laptop or other server. If you could no longer get to the the DC for export & import or heck even a shared nothing migration depending on your environment can help you out of this pickle. A restore from backup would also work. But here in that 2 node all in one cluster our approach was fast and efficient.

So there you go. Tip to remember. Virtualizing domain controllers is fully supported, no worries there but you need to make sure that if you have a dependency on a DC you don’t have the DC depending on that dependency. It’s chicken an egg thing.