SMB Direct With RoCE in a Mixed Switches Environment

I’ve been setting up a number of Hyper-V clusters with  Mellanox ConnectX3 Pro dual port 10Gbps Ethernet cards. These Mellanox cards provide a nice amount of queues (128) for DVMQ and also give us RDMA/SMB Direct capabilities for CSV & live migration traffic.

Mixed Switches Environments

Now RoCE and DCB is a learning curve for all of us and not for the faint of heart. DCB configuration is non trivial, certainly not across multiple hops and different switches. Some say it’s to be avoided or can’t be done.

You can only get away with a single pair of (uniform) switches in smaller deployments. On top of that I’m seeing more and more different types of switches being used to optimize value, so it’s not just a lab exercise to do this. Combine this with the fact that DCB is an unavoidable technology in networking, unless it get’s replaced with something better and easier, and you might as well try and learn. So I did.

Well right now I’m successfully seeing RoCE traffic going across cluster nodes spread over different racks in different rows at excellent speeds. The core switches are DELL Force10 S4810 and the rack switches are PowerConnect 8132Fs. By borrowing an approach from spine/leave designs this setup delivers bandwidth where they need it a a price point they can afford. They don’t need more expensive switches for the rack or the core as these do support DCB and give the port count needed at the best price point.  This isn’t supposed to be the top in non blocking network design. Nope but what’s available & affordable today in you hands is better than perfection tomorrow. On top of that this is a functional learning experience for all involved.

We see some pause frames being sent once in a while and this doesn’t impact speed that very much. It does guarantee lossless traffic which is what we need for RoCE. When we live migrate 300GB worth of memory across the nodes in the different racks we get great results. It varies a bit depending on the load the switches & switch ports are under but that’s to be expected.

Now tests have shown us that we can live migrate just as fast with non RDMA 10Gbps as we can with RDMA leveraging “only” Multichannel. So why even bother? The name of the game low latency and preserving CPU cycles for SQL Server or storage traffic over SMB3. Why? We can just buy more CPUs/Cores. Great, easy & fast right? But then with SQL licensing comes into play and it becomes very expensive. Also storage scenarios under heavy load are not where you want to drop packets.

Will this matter in your environment? Great question! It depends on your environment. Sometimes RDMA is needed/warranted, sometimes it isn’t. But the Mellanox cards are price competitive and why not test and learn right? That’s time well spent and prepares you for the future.

But what if it goes wrong … ah well if the nodes fail to connect over RDAM you still have Multichannel and if the DCB stuff turns out not to be what you need or can handle, turn it of and you’ll be good.

RoCE stuff to test: Routing

Some claim it can’t be done reliably. But hey they said that for non uniform switch environments too Winking smile. So will it all fall apart and will we need to standardize on iWarp in the future?  Maybe, but isn’t DCB the technology used for lossless, high performance environments (FCoE but also iSCSI) so why would not iWarp not need it. Sure it works without it quite well. So does iSCSI right, up to a point? I see these comments a lot more form virtualization admins that have a hard time doing DCB (I’m one so I do sympathize) than I see it from hard core network engineers. As I have RoCE cards and they have become routable now with the latest firmware and drivers I’d love to try and see if I can make RoCE v2 or Routable RoCE work over different types of switches but unless some one is going to sponsor the hardware I can’t even start doing that. Anyway, lossless is the name of the game whether it’s iWarp or RoCE. Who know what we’ll be doing in 5 years? 100Gbps iWarp & iSCSI both covered by DCB vNext while FC, FCoE, Infiniband & RoCE have fallen into oblivion? We’ll see.

More Tips On Dealing With Removing Short File Names When Migrating To a SMB3 Transparent Failover File Server Cluster

You might have read my blog posts on the capabilities and the process of migrating to a Transparent Failover File Server. If not, here they are:

These are a good read with some advice from real world experience and in this post I’ll offer some more tips. I’ve discussed the need to disable and get rid of short file names in my blog and offered other tips to prepare for your migration and get your file share LUNs in tip top, modern shape. But what if you run into short file name issues where you can seem to get rid of them?

Well here’s 3 more things to check:

1) Get rid of the shadow copies used for Previous Versions

The reason you’d better get rid of them is that they can also contain short files names & way to long path or file names. We don’t want them to ruin the party so we remove them all by disabling shadow copies on the LUNs to be copied. We can enable them again once the LUN is up and running in the new file cluster.

2) The logs indicate there are short file names you don’t have access to

If the NFTS permissions on the folder & file structure are OK you should not have to much problems bar some files being locked by being in use. Rerunning the fsutil command prior to migrating with the server service stopped will prevent any connectivity and use of file shares by people ignoring the request to log of or shut down their clients or automated jobs that otherwise keep accessing them.

But you might still get some indications in the log file(s) that state you can remove certain file names.


There is the good old trick of running your command under SYSTEM. That those the job! That helps get rid of short file name instances of folders where you normally don’t get access to. If system has rights you’ll be fine whether it’s a system folder or not.To do this the Sysinternals tools come in handy once again. You can launch a command prompt running under the NT AUTHORITY\SYSTEM account using psexec.exe by running the following from a elevated command prompt:

psexec -i -s cmd.exe or psexec  -s cmd.exe


The-s switch runs the remote process in the System account. Psexec temporarily installs a service "psexec running psexesvc.exe" on the remote computer (or locally if that’s what you doing) which is removed when the app or process that’s running is closed. It’s obvious now I hope why you need an elevated command prompt to run this command.

Now should you do this by default? Nope. Just when you need to and as always have a realistic backup plan, a way to recover when things go south.

3) Anti virus sometime prevents the removal of short file names

Disable Anti-Virus, sometimes it holds a temporary entry in the registry for the file involved. At least that’s what I’ve seen as a transient issue in some of the large number of logs I gathered. Yeah, I ran a lot of fsutil against large NTFS volumes. What can I say. Due diligence pays off!

4) Run ChkDsk

Just make sure the volume is healthy and no repairs are needed. If your migrating from and older file server there might be outstanding issues and a check disk on volumes with lot’s of files take time. Some of the ones I’ve dealt with had more that 2 million files on a 2TB LUN and it it can take 24 hours. Fun when you have 10 LUNs :-/

Concluding My Summit, Conference & Community Engagements for 2014

After Redmond (MVP Global Summit 2014), which was a great experience I flew to Berlin to attend and speak at the Microsoft Technical Summit 2014 on “What’s New In Windows Server 2012 R2 Clustering”. Germany has a seriously engaged ITPro & Dev scene, that’s for sure, and the session room was packed! Afterwards some interesting questions popped up in the hallways. That’s great as question really make us think about technologies and solutions from other view points and perspectives.


After Berlin I was off to Experts Live 2014 in Ede (The Netherlands) where I presented on “The capable & Scalable Cloud OS”. The talk went well and I had a great crowd attending with whom I had some great chats after the session.


That concluded the third leg of my international road tour where I invest in myself, the community & the people I work with. Never ever stop learning Smile. Normally this also concludes my traveling schedule for 2014 unless I’m needed/requested somewhere to help out. Being an MVP is about sharing in the community. The only way to prosper is to share the knowledge, experience and the wealth. It provides for a healthy ecosystem from which we all reap the benefits. This should be promoted and facilitated. There is too much expertise & knowledge not being leveraged due to the fact it’s economically unfeasible, and that’s a waste when people are screaming for IT skills. In a war for talent, any waste is surely very counter productive?

Golden Nuggets: Windows Server 2012 R2 Failover Cluster CSV Placement Policy

Some enhancements only become truly evident to people when they see them in action. For many features this means something need to go wrong before they kick in. Others are more visible during normal operations. This is the case with the CSV enhancements in Windows Server 2012 R2 Failover Clustering.

One golden nugget here is the CSV placement policy (which really shines in combination with SOFS/Storage Spaces). This will spread ownership of the CSV amongst the cluster nodes to ensure a balanced distribution. In a failover cluster, one node is the “coordinator node” (owner) for a CSV. The coordinator node owns the physical disk resource that is associated with a logical unit (LUN). All I/O operations for the File System on that LUN are are through the coordinator node. In previous versions there is no automatic rebalancing of coordinator node assignment. This means that all LUNs could potentially be owned by the same node. In storage spaces & SOFS scenarios becomes even more important.

The benefits

  • It helps all nodes carry their share of the workload as it load balances the disk I/O.
  • Failovers of CSV owners are potentially quicker and more predictable/consistent as an even distribution ensures that no one node owns a disproportionate number of CSVs.
  • When losing storage access the number of CSVs that are in redirected mode is potentially less as they are evenly distributed. In an unbalanced cluster it could be for all of them in a worse case scenario.
  • When using SOFS with Storage Spaces it makes sure the Storage Spaces Ownership is distributed fairly.

When does it happen

  • Each time a node leaves or joins the cluster. This means you don’t need to intervene manually or via PowerShell to get an even distribution. This goes for both exiting nodes as when adding a new node. The new node will get a CSV assigned if there is any on surplus on one of the existing nodes.
  • The process also works when you start a failover cluster when it has shut down.

When customers see this in action (it’s most obvious when then add a node as then they are normally watching) they generally smile as the cluster does it job getting  the best possible results out of their hardware.

Windows Server 2012 R2 Clustering brings improved CSV diagnosability

Cluster Shared Volumes (CSV) have can go into different redirected access modes for several reasons. Now a lot of people get (or got) worried about seeing “redirected access” in the GUI. Most of the time however this is due to normal operations such as backups or maintenance (defragmentation) not only losing disk access.

To remediate unneeded troubleshooting, sometimes leading to real issues, calls to MSFT support and so forth it was hidden from the Failover Clustering GUI in Windows 2012 R2. OK, so goal achieved but how do we now troubleshoot and view redirected access that might indicate the presence of real issues? The answer to that is the Get-ClusterSharedVolumeState PowerShell cmdlet. It displays the state of the CSVs on a per node basis for a cluster. You’ll see the type of the IO (Direct, File System Redirected and Block Redirected), if it’s completely unavailable  as well as the reason.

This is what the output looks like on a two node cluster where node A has lost it’s storage path or paths (MPIO) to the CSV. You’ll see that both CSV are in redirected access. Not only that but you can see what type (block redirected) and why (no disk connectivity).


Pretty neat and clear. I love this functionality by the way and It’s why I’m leveraging 10Gbps Ethernet extensively to make sure that CSV traffic get’s the bandwidth & latency to handle what it has to. If you realize it leverages SMB3 which provides SMB Multichannel and SMB Direct you know it will get the job done for you in your time of need.

While this is happening in the GUI you’ll see this


Nothing is going on … it would seem so a bit of monitoring and alerting would be of use here. The good news is finding out what’s up is very straight forward now.

Now there is still a case where you’ll see that the CSV is in redirected access mode and that when you’ve put it in there yourself via the GUI


or via PowerShell for maintenance reasons.


As you can see the Icon has change to a networked disk one and it states “Redirected Access”. With Get-ClusterSharedVolumeState the output looks like this.


You’ll always see warnings in the event logs.


So monitor those with SCOM or another tool that suits your taste and you’ll be in good shape to react when it’s needed and you now know how to find out what’s going on.

First experiences with a rolling cluster upgrade of a lab Hyper-V Cluster (Technical Preview)


In vNext we have gotten a long awaited  & very welcome new capability: rolling cluster upgrades. Which for the Hyper-V roles is a 100% zero down time experience. The only step that will require some down time is the upgrade of the virtual machine configuration files to vNext (version 5 to 6) as the VM has to be shut down for this.

How to

The process for a rolling upgrade is so straight forward I’ll just give you a quick bullet list of the first part of the process:

  • Evacuate the workload from the cluster node you’re going to upgrade
  • Evict the node to upgrade to vNext from the cluster
  • Upgrade (no in place upgrade supported but in your lab you can get away with it)
  • Add the upgraded node to the cluster
  • Rinse & repeat until all nodes have been upgraded (that can take a while with larger clusters)

Please note that all actions you administration you do on a cluster in mixed mode should be done from a node running vNext or a system running Windows 10 with the vNext RSAT installed.

Once you’ve upgraded all nodes in the cluster, the situation you’re in now is basically that you’re running a Windows Server vNext Hyper-V cluster in cluster functional level 8 (W2K12R2) and the next step is to upgrade to 9, which is vNext, no there no 10 yet in server Winking smile

You do this by executing the Update-ClusterFunctionalLevel cmdlet. This is an online process.  Again, do this from a node running vNext or a system running Windows 10 with the vNext RSAT installed. Note that this is where you’re willing to commit to the vNext level for the cluster. That’s where you want to go but you get to decide when. When you’ve do this you can’t go back to W2K12R2. It’s a matter of fact that as long as you’re running cluster functional level 8, you can reverse the entire process. Talk about having options! I like having options, just ask Carsten Rachfahl (@hypervserver), he’ll tell you it’s one of my mantras.


When this goes well you can just easily check the cluster functional level as follows:


When this is done you can do the upgrade of the VM configuration by running the Update-VMConfigurationVersion cmdlet. This is an off line process where the VMs you’re updating have to be shut down. You can do this for just one VM, all or anything in between. This is when you decided you’re committing to all the goodness vNext brings you.  But the fact that you have some time before you need to do it means you can  easily get those machine to run smoothly on a W2K12R2 cluster in case you need to roll back. Remember, options are good!

Doing so updates VM version from 5 to 6 and enables new Hyper-V features (hit F5 a lot or reopen Hyper-V Manager to see the value change.



Note: If in the lab you’re running some VMs on a cluster node are not highly available (i.e. they’re not clustered) they cannot be updated until the cluster functional level has been upgraded to version 9.

Defragmenting your CSV Windows 2012 R2 Style with Raxco Perfect Disk 13 SP2

When it comes to defragmenting CSV it seemed we took a step back when it comes to support from 3rd party vendors. While Windows provides for a great toolset to defragment a CSV it seemed to have disappeared form 3r party vendor software. Even from the really good Raxco Perfect disk. They did have support for this with Windows 2008 R2 and I even mentioned that in a blog.

If you need information on how to defragment a CSV in Windows 2012 R2, look no further.There is an absolutely fantastic blog post on the subject How to Run ChkDsk and Defrag on Cluster Shared Volumes in Windows Server 2012 R2, by Subhasish Bhattacharya one of the program managers in the Clustering and High Availability product group. He’s a great guy to talk shop to by the way if you ever get the opportunity to do so. One bizarre thing is that this must be the only place where PowerShell (Repair-ClusterSharedVolume cmdlet) is depreciated in lieu of chkdsk.

3rd party wise the release of Raxco Perfect Disk 13 SP2 brought back support for defragmenting CSV.


I don’t know why it took them so long but the support is here now. It looks like they struggled to get the CSVFS (the way CSV are now done since Windows Server 2012) supported. Whilst add it, they threw in support for ReFS by the way. This is the first time I’ve ever seen this. Any way it’s here and that’s good because I have a hard time accepting that any product (whatever it does) supports Hyper-V if it can’t handle CSV, not if you want to be taken seriously anyway. No CSV support equals = do not buy list in my book.

Here’s a screenshot of Perfect disk defragmenting away. One of the CSV LUNs in my lab is a SSD and the other a HDD.


Notice that in Global Settings you can tweak the behavior when defragmenting optimization of various drive types, including CSVFS but you just have to leave the default on unless you like manual labor or love PowerShell that much you can’t forgo any opportunity to use it Winking smile


Perfect disk cannot detect what kind of disks you have behind the CSV LUN so you might want to change the optimization method if you’re running SSD instead of HHD.


I’d love for Raxco to comment on this or point to some guidance.

What would also be beneficial to a lot of customers is guidance on defragmentation on the different auto-tiering storage arrays. That would make for a fine discussion I think.