Copy Cluster Roles Hyper-V Cluster Migration Fails at Final Step with error Virtual Machine Configuration ‘VM01′ failed to register the virtual machine with the virtual machine service


I was working on a migration of a nice two node Windows Server 2012 Hyper-V cluster to Windows Server 2012 R2. The cluster consist out of 2 DELL R610 servers and a DELL  MD3200 shared SAS disk array for the shared storage. It runs all the virtual machines with infrastructure roles etc. It’s a Cluster In A Box like set up. This has been doing just fine for 18 months but the need for features in Windows Server 2012 R2 became too much to resists. As the hardware needs to be recuperated and we have a maintenance windows we use the copy cluster roles scenario that we have used so many times before with great success. It’s the Perform an in-place migration involving only two servers scenario documented on TechNet and as described in one of my previous blogs Migrating a Hyper-V Cluster to Windows 2012 R2 for your convenience.

Virtual Machine Configuration ‘VM01′ failed to register the virtual machine with the virtual machine service

As the source host was running on Windows Server 2012 we could have done the live migration scenario but the down time would be minimal and there is a maintenance window. So we chose this path.

So we performed a good health check. of the source cluster and made sure we had no snapshots left hanging around. Yes it’s supported now for this migration scenario but I like to have as few moving parts as possible during a migration.

It all went smooth like silk. After shutting down the VMs on the source cluster node, bringing the CSV off line (and un-presenting the LUN from the source node for good measure), we present that LUN to the target host. We brought the CSV on line and when that was completed successfully we were ready to bring the virtual machines on line and that failed …

Log Name:      Microsoft-Windows-Hyper-V-High-Availability-Admin
Source:        Microsoft-Windows-Hyper-V-High-Availability
Date:          4/02/2014 19:26:41
Event ID:      21102
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      VM01.domain.be
Description:
‘Virtual Machine Configuration VM01′ failed to register the virtual machine with the virtual machine management service.

image

image

 

Let’s dive into the other event logs. On the host the application security and system event log are squeaky clean. The Hyper-V event logs are pretty empty or clean to except for these events in the Hyper-V-VMMS Admin log.

Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          4/02/2014 19:26:40
Event ID:      13000
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      VM01.domain.be
Description:
User ‘NT AUTHORITY\SYSTEM’ failed to create external configuration store at ‘C:\ClusterStorage\HyperVStorage\VM01′: The trust relationship between this workstation and the primary domain failed.. (0x800706FD)

 

image

Bingo. It must be the fact that no domain controller is available. It’s completely self contained cluster and both domain controller virtual machines are highly available and reside on the CSV. Now the CSV does come on line without a DC since Windows Server 2012 so that’s not the issue. it’s the process of registering the VMs that fails without a DC in an Active Directory environment.

Getting passed this issue

There are multiple ways to resolve this and move ahead with our cluster migration. As the environment is still fully functional on the source cluster I just removed a DC virtual machine from high availability on the cluster. I shut it down and exported it. I than copied it over to the node of the new cluster  (we’re going to nuke the source host afterwards and install W2K12R2, so we moved it to the new host where it could stay) where I put it on local storage and imported it. For this is used the “Register the virtual machine in-place option”. I did not make it high available.

image

After verifying that we could ping the DC and it was up and running well we tried the final phase of the migration again. It went as smooth as we have come to expect!

Other options would have been to host the DC virtual machine on a laptop or other server. If you could no longer get to the the DC for export & import or heck even a shared nothing migration depending on your environment can help you out of this pickle. A restore from backup would also work. But here in that 2 node all in one cluster our approach was fast and efficient.

So there you go. Tip to remember. Virtualizing domain controllers is fully supported, no worries there but you need to make sure that if you have a dependency on a DC you don’t have the DC depending on that dependency. It’s chicken an egg thing.

RDMA Over RoCE With DCB Requires Tagged Non Default VLANs


It’s DCB That Requires This

For those of you who are experimenting with the RoCE variant of RDMA for SMB Direct in Windows Server 2012 (R2), make sure you have a VLAN tag in your configuration if this is more than a simple RDMA over two NICs. The moment you get DBC with PFC & ETS involved you’ll need non default tagged VLANs. Do note that PFC alone is good enough, ETS is strictly speaking not a requirement, but I’d consider doing it if you can.

With Enhanced Transmission Selection (ETS) the network traffic type is classified using the priority value in the VLAN tag of the Ethernet frame. The priority value is the Priority Code Point (PCP), which is described in the IEEE 802.1Q specification and uses a 3-bit field in the VLAN tag with eight possible priority values (0 to 7).

Priority-based Flow Control (PFC) allows to individually pause priorities of tagged traffic and helps to provide lossless or “no drop” behavior for a certain priority at the receiving port. As  above, each frame transmitted by a sending port is tagged with a priority value (0 to 7) in the VLAN tag. So for the traffic pause and resume functionality to work we need a VLAN tag to carry the priority value.

Does It Work Without?

But you’ll tell me that, as you may be lacking a DCB capable switch for lab purposes, you used a direct cable between your two RoCE NICs. And guess what RoCE, might have indeed worked for you without a VLAN tag. You can test & get a feel for what RoCE/RDMA can do for you with just the NICs. But as there is no switch involved you’re not using DCB for PFC/ETS and without that the need for the tagged VLAN isn’t there. Also see http://workinghardinit.wordpress.com/2013/05/03/smb-direct-roce-does-not-work-without-dcbpfc/.

So there you go. Design your RoCE/RDMA network based on DCB with PFC( and ETS) and not just on the tests with an direct cable or you might miss a few details that are quite important. Happy testing!

Reverting the Forest & Domain Functional Levels in Window Server 2008 R2, 2012, 2012 R2


Since Windows Server 2008 R2 and now with Windows Server 2012(R2)you can roll back the domain and forest functional level under certain conditions. This was not possible before with previous versions of Windows. In these cases you would have to revert to a restore from backup. Yup pretty hefty so raising functional levels has to be done with care.

Now this isn’t a free fire zone there are some conditions as listed in the table below.

image

So you cannot have advanced features like the AD recycle bin enabled in some conditions. Enabling this is irreversible, so you cannot revert the Forest Functional Level of your environment to a level that supports the AD recycle bin when it has been enabled. Today that means from Windows Server 2012(R2) to Windows Server 2008 R2.

You also need Enterprise Administrator rights to do so, which I hope you’ll understand. It’s also a Windows PowerShell only feature (Set-ADDomainMode).

I used this information recently during an upgrade of an Windows Server 2008 R2 domain to Windows Server 2012 where they wanted to raise the domain and forest functional level. As they had a Forest Trust between the (now) Windows Server 2012 forest/domain and another Windows Server 2008 R2 forest/domain. They had enabled the Recycle Bin when still at Windows 2008 R2. They wanted to know if they would have issues with the trust and if so whether they could revert the levels in that case.

Well I could put their mind at ease. Look at the table. Yes you can go back to Windows 2008 R2 Forest Functional level as that’s a version that also supports AD Recycle bin so it doesn’t matter that is enabled.  And no, the forest trust capability is not affected by the forest functional level in this case as all you need there is to be at a minimum level of Windows 2003 to be able to do a forest trust. Forest Trust is enabled from and above Windows Server 2003 Forest functional Level. In a Windows Server 2000 Forest functional Level, Forest Trust is disabled. That means you can do them between forests at different functional levels a long as non of them is lower than Windows 2003. In this case it’s Windows 2008 R2 that’s the lowest, so again, not an issue.

How? Very simple:

Set-ADDomain Mode mydomain.com -DomainMode Windows2008R2Domain

Set-ADForestMode mydomain.com -ForestMode Windows2008R2Forest

Take a look at these TechNet Resources Understanding Active Directory Domain Services (AD DS) Functional Levels  and Set-ADDomainMode for more information.

Windows Server 2012 R2 Cluster Reset Recent Events With PowerShell


I blogged before about the fact that since Windows Server 2012  we have the ability to reset the recent events shown so that the state of the cluster is squeaky clean with not warnings or errors. You can read up on this here. Windows Server 2012 Cluster Reset Recent Events Feature.

You can also do this in PowerShell like in the example below:

#Connect to cluster & get current RecentEventsResetTime value
$MyCluster = Get-CLuster -name "W2K12R2RTM"
$MyCluster.RecentEventsResetTime

#Reset recent events
$MyCluster.RecentEventsResetTime = get-date
$MyCluster.RecentEventsResetTime

 

 

 

 

As you may notice, the RecentEventsResetTime is displayed in UTC when read form the cluster after connecting to it. Right after you set it it displays the time respectful of the time zone you’re in right until you connect to the cluster again. We demonstrate this in the 2 screenshots below (I’m at GMT+1).

image

image

This comes in handy when writing test, comparison & demo scripts. Often you do things with the network that causes network connectivity to be lost when the NIC gets reset (disabled/enabled) and such. Also when something fails as part of the demo or tests scripts it’s nice to start the rerun or the next part of the demo/test with a clean cluster GUI when you’re showcasing stuff. Unfortunately an already GUI doesn’t refresh these setting if the reset is not done in the GUI. So you need to open a new one. For scripting you don’t have this issue. EDIT: In Windows 2012 R2 you can use the $MyCluster.Update() to reflect the new value of RecentEventsResetTime in UTC without having to reconnect to the cluster. In Windows Server 2012 this Update method isn’t available but it seems to happen automatic.

Failed Live Migrations with Event ID 21502 Planned virtual machine creation failed for virtual machine ‘VM Name’: An existing connection was forcibly closed by the remote host. (0×80072746) Caused By Wrong Jumbo Frame Settings


OK so Live Migration fails and you get the following error in the System even log with event id 21502:

image

Planned virtual machine creation failed for virtual machine ‘DidierTest01′: An existing connection was forcibly closed by the remote host. (0×80072746). (Virtual Machine ID 41EF2DB-0C0A-12FE-25CB-C3330D937F27).

Failed to receive data for a Virtual Machine migration: An existing connection was forcibly closed by the remote host. (0×80072746).

There are some threads on the TechNet forums on this like here http://social.technet.microsoft.com/Forums/en-US/805466e8-f874-4851-953f-59cdbd4f3d9f/windows-2012-hyperv-live-migration-failed-with-an-existing-connection-was-forcibly-closed-by-the and some blog post pointing to TCP/IP Chimney settings causing this but those causes stem back to the Windows Server 2003 / 2008 era.

In the Hyper-V event log Microsoft-Windows-Hyper-V-VMMS-Admin you also see a series of entries related to the failed live migration point to the same issue: image

  
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:15 AM
Event ID:      20413
Task Category: None
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
The Virtual Machine Management service initiated the live migration of virtual machine  ‘DidierTest01′ to destination host ‘SRV2′ (VMID 41EF2DB-0C0A-12FE-25CB-C3330D937F27).
 
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      22038
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
Failed to send data for a Virtual Machine migration: An existing connection was forcibly closed by the remote host. (0×80072746).
 
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      21018
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
Planned virtual machine creation failed for virtual machine ‘DidierTest01′: An existing connection was forcibly closed by the remote host. (0×80072746). (Virtual Machine ID 41EF2DB-0C0A-12FE-25CB-C3330D937F27).
 
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      22040
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      SRV1.BLOG.COM
Description:
Failed to receive data for a Virtual Machine migration: An existing connection was forcibly closed by the remote host. (0×80072746).
Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          10/8/2013 10:06:26 AM
Event ID:      21024
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      srv1.blog.com
Description:
Virtual machine migration operation for ‘DidierTest01′ failed at migration source ‘SRV1′. (Virtual machine ID 41EF2DB-0C0A-12FE-25CB-C3330D937F27)

There is something wrong with the network and if all checks out on your cluster & hosts it’s time to look beyond that. Well as it turns out it was the Jumbo Frame setting on the CSV and LM NICs.

Those servers had been connected to a couple of DELL Force10  S4810 switches. These can handle an MTU size up to 12000. And that’s how they are configured. The Mellanox NICs allow for MTU Sizes up to 9614 in their Jumbo Frame property.  Now super sized jumbo frames are all cool until you attach the network cables to another switch like a PowerConnect 8132 that has a max MTU size of 9216. That moment your network won’t do what it’s supposed to and you see errors like those above. If you test via an SMB share things seem OK & standard pings don’t show the issue. But some ping tests with different mtu sizes & the –f (do no fragment) switch will unmask the issue soon. Setting the Jumbo Frame size on the CSV & LM NICs to 9014 resolved the issue.

Now if on the server side everything matches up but not on the switches you’ll also get an event id 21502 but with a different error message:

Event ID: 21502 The Virtual Machine Management Service failed to establish a connection for a Virtual machine migration with host XXXX. A connection attempt failed because the connected party did not properly respond after a period of time, or the established connection failed because connected host has failed to respond (0X8007274C)

image

This is the same message you’ll get for a known cause of shared nothing live migration failing as described in this blog post by Microsoft Shared Nothing Migration fails (0x8007274C).

So there you go. Keep an eye on those Jumbo Frame setting especially in a mixed switch environment. They all have their own capabilities, rules & peculiarities. Make sure to test end to end and you’ll be just fine.

Live Migration Can Benefit From Jumbo Frames


Does live migration benefit from Jumbo frames? This question always comes back so I’d just blog it hear again even if I have mentioned it as part of other blog posts. Yes it does! How do I know. Because I’ve tested and used it with Windows Server 2008 R2, 2012 & 2012 R2. Why? because I have a couple of mantra’s:

  • Assumption are the mother of all fuckups
  • Assume makes an ASS out of U and ME
  • Trust but verify

What can I say. I have been doing 10Gbps since for Live Migration with Hyper-V. And let me tell you my experiences with an otherwise completely optimized server (mainly BIOS performance settings): It will help you with up to 20% more bandwidth use.

And thanks to Windows Server 2012 R2 supporting SMB for live migration we can very nicely visualize this with 2*10Gbps NICS, not teamed, used by live migration leveraging SMB Multichannel. On one of the 10Gbps we enable Jumbo Frames on the other one we do not. We than live migrate a large memory VM back and forth. Now you tell me which one is which.

image

Now enable Jumbo frames on both 10Gbps NICs and again we live migrate the large memory VM back and forth. More bandwidth used, faster live migration.

image

I can’t make it any more clear. No jumbo frames will not kill your performance unless you have it messed up end to end. Don’t worry if you have a cheaper switch where you can only enable it switch wide instead op port per port. The switch is a pass through. So unless you set messed up sizes on sender/receiving host that the switch in between can’t handle, it will work even without jumbo frames and without heaven falling down on your head Smile. Configure it correctly, test it, and you’ll see.

Upgrading Firmware Of Mellanox RoCE Cards for Final Windows Server 2012 RDMA Testing


Upgrading Mellanox Firmware

As we are preparing to roll out Windows Server 2012 R2 we are also updating the Mellanox cards we have. At the moment of writing the final driver & firmware for Windows Server 2012 R2 isn’t out yet, but let’s take a look at the process so you’re ready for prime time. If you need the latest public Mellanox driver for Windows Server 2012 R2 it’s here. Installing the driver is a straight forward process (upgrading servers with Mellanox drivers in place has been an issue however).

Mellanox provides good documentation on their site (http://www.mellanox.com/page/firmware_HCA_FW_identificationhttp://www.mellanox.com/page/firmware_NIC_FW_update) but for Mellanox newbies & many Windows server admins the process might be a bit more hands on than via a single installer they are used to.

What do you need?

The Windows Mellanox Firmware Tools (WinMFT). This gives you all the tools you need to get the job done.

It helps us with two things: find out Card ID and using that we can determine the PSID (Board ID) which tells us what firmware we need to down load.

The Win MFT tools are also used to burn the firmware.

Practical Tip 1: I have found that it pays to launch the installers Mellanox provides from an elevated command prompt as other wise UAC might trip up some clean finalization of a launched msi. The driver installer is more sensitive to this that the firmware installer.

Practical Tip 2: I you have OEM Mellanox cards from DELL/ HP/IBM … and they haven’t released the new firmware yet you can always burn your own. Please find the instructions here.

Walkthrough

I have a Windows Server 2012 R2 RTM running and I already installed the latest beta drivers I could find on the Mellanox site. But I’m a firmware version behind. So let’s fix this.

image

I put all the files I need in one handy spot

image

I launch an elevated command prompt

image

And from there I lauch the WinMFT installer

image

Just follow the instructions. image

image

image

image

image

Now you’re ready to determine the Device ID of your Mellanox card. From that same elevated command prompt navigate to C:\Program Files\Mellanox\WinMFT and run mst status

image

Grab the Device ID (marked in green) and execute following command:

flint -d /dev/mst/mt4099_pci_cr0 query

image

The Board ID (marked in yellow) is actually the PSID (more information here) will tell you what firmware to download from the Mellanox site). By the way, note this also tells you the current firmware.

You download the firmware from http://www.mellanox.com/page/firmware_download by selecting the card you have. In my case a ConnectX®-3 EN PCI-Ex Network Interface Cards (Ethernet Only NICs) and is use the Board ID to find my download.

image

All that’s left to do is burn the firmware image by executing the following command:

flint -d /dev/mst/mt4099_pci_cr0  -i C:\SysAdmin\Mellanox\Firmware\fw-ConnectX3-rel-2_30_3000-MCX312A-XCB_A2-A6-3.4.142_EN.bin burn

This requires you to confirm by typing in “y” and you can follow the process via a counter.image

When done you’ll need to reboot the server I order for the new firmware to actually get used. You can verify success by running the command again or by checking the information tab of you cards configuration settings. As you can see we’re running 2.30.3000 now.

image

So here you go. You might need to do this again after October 18th 2013 but you’re ready for now and all the testing you do is on the latest version of both the driver and the firmware. Happy testing!

ODX Speed Up VHDX Creation Times On Windows Server 2012 (R2)


Some technlogies you just need to see in action instead of reading about it. I have posted a video on Vimeo that shows ODX in action on Windows Server 2012 R2 and a DELL Compellent SAN running Storage Center 6.3.10 firmware that supports UNMAP & ODX. Watch the video here or on Vimeo itself for a better experience. It’s a rerun of the demo scripts used in my TechNet Belux Live Meeting of this week.

We demonstrate the amazing speeds at which we can create VHDX files on both a traditional clustered disk and a Cluster Shared Volume. If you have ever tried to create a lot of fixed VHD/VHDX files, especially larger one, then you really need to check out ODX and its potential. If you have a SAN or think about acquiring one make sure you get this feature and be sure that it works as advertised.

I hope you enjoy it and inspires you to look where you can leverage this technology in your own environments.

Mind the UNMAP Impact On Performance In Certain Scenarios


The Problem

Recently we’ve been trouble shooting some weird SQL Server to file backup issues. They started failing on the clock at 06:00 AM. We checked the NICs, the switches, the drivers, the LUNs, HBAs, … but it was all well. We considered over stressed buffers as the root cause or spanning tree issues but the clock steadiness of it all was weird. We tried playing with some time out parameters but with little to no avail. Until the moment it hit me, the file deletions that clean up the old backups!We had UNMAP enabled recently on the SAN.

Take a look at the screenshot below an note the deletion times underlined in red. That’s with UNMAP enabled. Above is with UNMAP disabled. The Backup jobs failed waiting for the deletion process.

image

This is a no issues if your backup target is running something prior to Windows Server 2012. if not, UNMAP is disabled by default. I know about the potential performance impact of UNMAP when deleting or more larger files due to the space reclamation kicking in. This is described here Plan and Deploy Thin Provisioning under the heading “Consider space reclamation and potential performance impact”. But as I’m quite used to talking about many, many terabytes of data I kind of forget to think of 500 to 600GB of files as “big” Embarrassed smile. But it seemed to a suspect so we tested certain scenarios and bingo!

Solutions

  1. Disable the file-delete notification that triggers real-time space reclamation. Find the following value HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem\DisableDeleteNotification and set it to 1.

    Note that: This setting is host wide, so for all LUNs. Perhaps that server has many other roles or needs to server that could benefit from UNMAP. If not this is not an issue.  It is however very efficient in avoiding issues. You can still use the Defragment and Optimize Drives tool to perform space reclamation on-demand or on a scheduled basis.

  2. Create LUNs that will have high deltas in a short time frame as fully provisioned LUNs (aka thick LUNs). As you do this per LUN and not on the host it allows for more fine grained actions than disabling UNMAP.  It makes no sense to have UNMAP do it’s work to reclaim the free space that deleting data created when you’ll just be filling up that space again in the next 24 hours in an endless cycle. Backup targets are a perfect example of this. This avoid the entire UNMAP cycle and you won’t mind as it doesn’t make much sense and fixes you issue. The drawback is you can’t do this for an existing volumes. So it has some overhead & downtime involved depending on the SAN solution you use. It also means that you have to convince you storage admins to give you fully provisioned LUNs, which might or might not be easy depending on how things are organized.

Conclusion

UNMAP has many benefits both in the physical and virtual layer. As with all technologies you have to understand its capabilities, requirements, benefits and draw backs. Without this you might run into trouble.

Adventures In RDMA – The RoCE Path Over DCB To Windows Server 2012 R2 SMB 3.0 Glory


Prologue

On gloomy day, it was dark, grey and cold, we gave battle with RoCE & DCB (PFC/ETS). The fight was a long one, the battle field uncharted and we had only our veteran attitude towards adversity to guide us through the switch configurations. It seemed that no man had gone that far to the edges of the Windows Server 2012 empire. And when it came to RoCE & DCB meets Didier, I needed to show it that it had been conquered and was remembered of a quote in Gladiator:

Quintus: People RoCE/DCB configs should know when they are conquered.
Maximus: Would you, Quintus? Would I?

image

After many, many lonely & unsuccessful hours dealing with Performance monitor, switch configurations, reloads, firmware, drivers & Windows we got results:

… “it’s working” … “holey s* look at those numbers” …

On that dark day in a scarcely illuminated room, in the faint glare of the monitors even the CLI  of the switches in PUTTY felt like a grim cold place. But all that changed at as the impressive results brightened up the day and made all efforts seem worthwhile. “Didier Victor” I thought as I looked away from the screen, ‘”Once more”.image

But it has been a hard won victory. And should you fight this battle? We’ll let’s discuss this a bit now we’ve got your attention. RDMA is a learning process for many of us and neither Infiniband,  iWARP or RoCE are the one that need to win at this game. It’s you, via the knowledge you’ll gain working with RDMA technologies.

SMB Direct or SMB over RDMA comes in flavors

Infiniband (Mellanox)

That’s been here for a while. Has high cost associated (depends on where you come from) and also has a psychological barrier to it. Try discussing buying 10Gbps versus Infiniband with semi technical managerial types. You’ll know what I mean.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-2/ConnectX-3 using InfiniBand – Step by Step

iWARP (Chelsio / Intel)

RDMA but it’s TCP/IP offloaded to the card. It can leverage DCB but doesn’t require it.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Chelsio T4 cards using iWARP – Step by Step

RoCE (Mellanox)

“Infiniband over Ethernet” > so you “NEED” (no not a real hard requirement) DBC with PFC/ETS (DCBx can be handy) for it to work best. No need for Congestion Notification as it’s for TCP/IP but could be nice with iWARP (see above). Do note that you’ll need to configure your switches for DCB & that’s highly dependent on the vendor & even type of switches.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-3 using 10GbE/40GbE RoCE – Step by Step

Here’s an older overview of RDMA flavor’s pros & cons:image

Please see Jose Barreto’s excellent work on explaining SMB 3.0 over RDMA in his presentations at SNIA, TechEd and on his blog.

While I have heard of two people I have in my network working with Infiniband for SMB Direct and Windows Server 2012 (R2) most of us are doing 10Gbps. Pricing for Infiniband has a bad reputation. Not because Infiniband is super costly compared to 10/40Gbps (I’m told by most people who ask quotes are positively surprised) but when you can’t afford a Porsche you’re not shopping for a Ferrari either.  Especially not when a mid size sedan will serve al of your needs above and beyond the call of duty. On top of that you might have bought all that nice “converged network ready” 10Gbps gear some years ago. Some of us may be working towards 40Gbps but most are 10Gbps shops. My 40Gbps is “limited” to the inter links & uplinks. Meaning that we either go for iWarp or RoCE.

RoCE or iWARP

Which one is best of those two? Well, as the line is drawn between vendors. RoCE today equals Mellanox (yes the Infiniband vendor, RoCE is sometimes called “Infiniband on layer 4 over Ethernet layer 2”) and iWarp means Chelsio or Intel (their cards look a bit old in the teeth however).

You’ll find comparisons by both vendors claiming superiority for varied reasons. Here’s the Mellanox side http://www.mellanox.com/pdf/whitepapers/WP_RoCE_vs_iWARP.pdf & here’s Chelsio’s take http://www.chelsio.com/roce/ & http://www.moderntech.com.hk/sites/default/files/whitepaper/V09_iWAR_Summary_WP_0.pdf. It’s good to look at your needs and map them. But I cannot declare a winner. I did notice that at least one vendor of SOFS/CiB uses iWarp. Is that a statement? And if so about what? Price? Easy of use? Perfomance/Cost?

What I do find is that Chelsio is really hacking into RoCE as you can see here http://www.chelsio.com/wp-content/uploads/2011/05/RoCE-The-Grand-Experiment1.pdf, http://www.chelsio.com/roce-whitepaper/, http://www.chelsio.com/wp-content/uploads/2011/05/RoCE-FAQ-1204121.pdf So that begs the question are the right or are the scared of RoCE, as the Infiniband boys are out to eat their lunch?

My take on this for now

iWarp is way easier to get started. That’s for sure. RoCE  is firmware sensitive (NIC, Switches), driver sensitive (NIC). Configuring your switches (DCB) now is usually followed by a rebooting that switch (so you might not do that so easily in production and depending on where in the stack those switches live you really need to Force10 VLT or Cisco vPC, Arista MLAG  or a independent redundant switches to get away with it. RoCE loves green field. Stacking I hear you say? I don’t like stacking on that spot of the stack as firmware updates will get you to suffer through a single point of failure.

Disclaimer: RoCE in itself does not  DEMAND/REQUIRE DCB but the consensus is that it will work better, especially under heavy load. Weather SMB Direct over RoCE requires DCB is another question. For all practical purposes I’m working from the prerequisite it does for a production environment. But as you can do RoCE RDMA between to NIC with no DCB switch in between this indicates that the hard requirement for DCB is not there. Mind you not using DCB might not be smart in regards to QoS & error handling (no TCP/IP goodness handling this for you). But I’m no expert on this subject. Paul Grun however is and he’s involved with RoCE at  https://www.openfabrics.org/component/search/?searchword=Paul+grun&ordering=&searchphrase=all They tend to know their stuff. Read some of the comments below this article and you’ll know a lot http://www.hpcwire.com/hpcwire/2010-04-22/roce_an_ethernet-infiniband_love_story.html But PFC isn’t Walhalla either and some claim you can just forget about it and build non blocking networks. I guess you could if your pockets are deep enough Smile. And you might go a very long way without the need for RDMA. Many do … and when you talk to some network people & vendors they can’t agree either as everyone is on the same learning curve but from a different perspective. There is no one size fits all & it all depends.

iWarp doesn’t require DCB so you can get away with cheaper switches. Or, not so cheap switches that don’t support DCB (choose wisely). So cheaper switches is probably true on the low end. But, even very economically priced switches from DELL have good DCB support. Some other vendors who are more expensive don’t.

DCB is uncharted terrain for SMB Direct purposes & new to many for us. So if you want to do RDMA the easy way  … go iWARP. As said, the use of DCB for PFC/ETS is not mandatory in that case, you’ll get great results and it’s easy.  Mind you, you’ll still be dabbling with DCB if you want to do lossless magic in the switches Smile. Why you say? Well, that “converged network” story makes it kind of interesting to do so and PFC, DCBx/TLV is generic and can be leveraged for other things than iSCSI or FCoE.  And for all practical purposes SMB 3.0 with SMB Direct is a storage protocol since Windows Server 2012 made it so (CSV). Or do you do DCB for iSCSI/FCoE & iWarp for SMB Direct? After all there’s only 2 lossless queues to be had. But hey how many do you need? Choices, choices and no vast pool of experienced practitioners yet.

iWARP routes, it’s not bound by a single Ethernet broadcast domain. That could be useful info depending on your environment & needs. I’ll note that I leverage RDMA for East-West traffic, not north south & as such this could not be an issue. The time that I do “Shared Nothing Live Migration" from on premise to the cloud has not arrived yet.

The Mellanox cards in my neck of the woods were 35% cheaper than Chelsio (SFP+)

What about the scalability? “iWarp doesn’t scale that well” is stated left and right but I think that might often be based on older information. Chelsio makes a strong case for iWARP scalability. Especially when it comes to long distances, multiple hops & routing.

Again, your mileage may vary. But for “the smaller environments” who want to leverage RDMA with SMB 3.0 I’d say that iWarp is the easiest path to go & will do just fine. Now if you’re already into lossless Ethernet for iSCSI or working with FCoE you might have all the hardware you need & the experience to deal with DCB. The latter might not always be true however. Most people have Lossless Ethernet for iSCSI or FCoE set up by the vendor or consultants who’ll use well defined step by step guides. These do not exist for the RoCE variant of SMB 3.0 over RDMA.

The case for RoCE can be made as well.  Some claim that high volume of connections consumes memory when using iWARP and TCP’s flow and reliability controls are less suited for large-scale datacenters & cloud deployments due to performance issues. Where iWARP does not know multicast, RoCE does and that could be important to you.

So why did I or still do RoCE?

So why did I walk the walk? Basically because just talking the talk isn’t enough. We considered it an investment in our education. DCB is not going away (the abstraction isn’t their yet and won’t be for a while) and we need to gain knowledge of it to both handle it and make informed decisions. By the way once you go to lossless you might leverage DCB/PFC with iWarp as well just like you do for iSCSI to make it lossless (leveraging DCBx/TLV). Keep in mind that DCB is key in converged networking and as such deserves your attention. That’s why I chose not to avoid it but gave battle. DCB is all over the place when it comes to converged networking (iSCSI, FCoE), so we need to learn the good, the bad and the ugly. Until that day that perhaps, the hardware stack is that good, powerful & has so much bandwidth TCP/IP never needs it built in protection for packet loss. Hmmmmmm, I remember people saying that about 10Gbps, but then they wanted to send everything over 2*10Gbps pipes and it becomes an issue again?

It’s early days yet but you have to give credit to Microsoft for getting RDMA/DCB on the radar screen of the worlds virtualization & storage admins than ever before. It’s not a well established segment yet and it will be interesting to see how this all turns out. I do know that now that I’ve figured out a thing or two about RoCE, I won’t be intimidated & won’t make choices out of fear. And do remember that if you have plenty of idle CPU cycles & 10Gbps you might not even need RDMA. The value for me and my employers is the knowledge gained. DCB has it’s role to play but we’ll leverage iWARP or RoCE without a preference. Today you have 2 choices. RoCE is the newer one while iWarp has been around longer and both have avid proponents it seems.

I know one thing. If you need or want RDMA in any existing 10Gbps environment with minimal effort & no risk to existing switch infrastructure, you’ll use iWarp it seems.

Epilogue

You sit there staring at a truckload of VMs with 120GB of memory assigned in total being evacuated in +/- 70 seconds seconds, while doing a Shared Nothing Live Migration between the same hosts and without consuming CPU load …  and have DCB for SMB 3.0 running on your switches … Yes!

image

Remember, “What we do in life, echo’s in eternity” Winking smile You might think now that I’m a bit nutty, but I assure you that in my quest to find someone who had hands on experience configuring DCB on switches for SMB Direct with RoCE I had to turn to myself as no one seems to have done it.  I’ll be sharing more info on our setup and configurations in the future. Once you wrap your head around the concepts, you understand why things are done and how. There in lies the value for me.