Virtual Receive Side Scaling (vRSS) In Windows Server 2012 R2 Hyper-V


What is it?

One of the cool new features that takes scalability in Windows Server 2012 R2 Hyper-V to a new level is virtual Receive Side Scaling (vRSS). While since In Windows Server 2012, Receive Side Scaling (RSS) over SR-IOV is supported it’s best suited for some specialized environments that require the best possible speeds at the lowest possible latencies. While SR-IOV is great for performance it’s not as flexible as for example you can’t team them so if you need redundancy you’ll need to do guest NIC Teaming.

vRSS is supported on the VM network path (vNIC, vSwitch, pNIC) and allows VMs to scale better under heavier network loads. The lack of RSS support in the guest means that there is only one logical CPU (core 0) that has to deal with all the network interrupts.  vRSS avoid this bottleneck by spreading network traffic among multiple VM processors. Which is great news for data copy heavy environments.

What do you need?

Nothing special, it works with any NICs that supports VMQ and that’s about all 10Gbps NICs you can buy or posses. So no investment is needed. It’s basically the DVMQ capability on the host NIC that has VMQ capabilities that allows for vRSS to be exposed inside of the VM over the vSwitch. To take advantage of vRSS, VMs must be configured to use multiple cores, and they must support RSS => turn it on in the vNIC configuration in the guest OS and don’t try to use a home PC 1Gbps card Winking smile

image

vRSS is enabled automatically when the VM uses RSS on the VM network path. The other good news is that this works over NIC Teaming. So you don’t have to do in guest NIC Teaming.

What does it look like?

Now without SR-IOV it was a serious challenge to push that 10Gbps vNIC to it’s limit due to all the interrupt handling being dealt with by a single CPU core. Here’s what a VMs processor looks like under a sustained network load without vRSS. Not to shabby, but we want more Smile

image
As you can see the incoming network traffic has the be dealt with by good old vCore 0. While DVMQ allows for multiple processors on the host dealing with the interrupts for the VMs it still means that you have a single core per VM. That one core is possibly a limiting factor (if you can get the network throughput and storage IO, that is). vRSS deals with this limitation. Look at the throughput we got copying  lot of data to the VM below leveraging vRSS. Yeah that’s 8.5Gbps inside of a VM. Sweet Open-mouthed smile. I’m sure I can get to 10Gbps …

image

Advertisements

Some ODX Fun With Windows Server 2012 R2 And A Dell Compellent SAN


I’m playing and examining some of the ODX capabilities of our SANs (Dell, Compellent) at the moment. It all seems pretty impressive in the demo’s. But how does that behave in real live on our gear? How impressive is ODX? Well pretty darn impressive actually. And as all great power it needs to be wielded carefully, with insight and thought.

Let’s create some fixed virtual disks. 10 * 50GB vhdx and 10* 475GB vhdx. We run a simple quick PowerShell script:

image

You see this correctly, it’s 41.5088855 seconds. let’s round up to 42 seconds. That’s 20 fixed VHDX files. 10 of 50GB, 10 of 475GB in 42 seconds. That’s a total of 5.12TB of vhdx files.

image

Compared to creating a single 5TB vhdx file this isn’t to shabby as that get done in 26 seconds!

You can only dream of the kind of scenario’s this kind of power enables. Woooot!!!

House Keeping In The Cluster Aware Updating GUI


When you work in an environment with multiple clusters and some of them are replaced, destroyed etc, you’ll end up with stale clusters in the “Recent Clusters” list of the Cluster Aware Updating GUI. In the example below the red entry (had to obfuscate, sorry) is a no longer existing cluster but it’s very similar to a new one that was created to fix a naming standard error. So we’d like to get rid of those to prevent mistakes and cluttering up the GUI with irrelevant information.image

The Recent Cluster list is tied to your user profile and you can end up with a list polluted with stale entries of no longer existing clusters. To clean them out you can dive into the registry and navigate to:

HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\ClusterAwareUpdating\ClusterMRUimage

Simply delete the entries that contain the values of the old cluster that are no longer in existence.

Close the Cluster Aware Updating GUI it still open and reopen it. You’ll see the stale entries to the one or more no longer exiting clusters in “Recent Clusters” is gone.image

Live Migration over NIC Team in Switch Independent Mode With Dynamic Load Balancing & Compression in Windows Server 2012 R2


In a previous blog post Live Migration over NIC Team in Switch Independent Mode With Dynamic Load Balancing & TCP/IP in Windows Server 2012 R2 we looked at what Dynamic load balancing mode in NIC teaming can do for us . Especially in a switch independent configuration as until now there was no possibility to leverage the complete bandwidth provided by the NIC team when migrating between only 2 nodes. I that blog we used TCP/IP. Now we’ll configure Compression and see what that does for us.

So we set up a NIC team in switch independent mode with Dynamic load balancing, it’s identical as that one used for the tests with TCP/IP.

Compression basically slashes the live migration times in half at a cost. CPU cycles.And again with Dynamic load balancing we can now also use all member of a NIC team for live migration even in switch independent mode. The speeds for live migrating 6 VMs  with 9GB of memory simultaneously were 12-14 seconds.

image

Take a look at the screen shot above. You see 6 VMs coming in to the host where these counters are collected and after that you see them being live migrated away from the host. As we have plenty of idle cycles I this test lab they get used, both when being the target and the source of the VMs being live migrated. You can also see that a lot less bandwidth is needed to achieve a faster live migration experience (compared to TCP/IP).

By the looks of it the extra bandwidth will help out when we have less CPU and vice versa. This is both the case for a single NIC or teamed NICs. Do note that you cannot combine compression with Multichannel. That means that the only scenario allowing for multiple NICs to be used with compression is NIC teaming. When you have a bunch  of free 1Gbps NICs in surplus this might get things moving for you!

Interesting stuff. I’m really looking forward to the moment we can run production loads on these configurations …

Live Migration over NIC Team in Switch Independent Mode With Dynamic Load Balancing & TCP/IP in Windows Server 2012 R2


As you can imagine I was quite interested in seeing what the new Dynamic load balancing mode in NIC teaming can do for us. Especially in a switch independent configuration as until now there was no possibility to leverage the complete bandwidth provided by the NIC team when migrating between only 2 nodes.

So we set up  a NIC team in switch independent mode with Dynamic load balancing. Here’s a screenshot of the NIC team setup. LM is the NIC team I’m using for some live migration testing.image

For these tests we used TCP/IP to do the live migrations. I’ll be sharing the compression & Multichannel performance option results in a later blog and do some comparisons. But for now I can inform you that with Dynamic load balancing we can now also use all member of a NIC team for live migration even in switch independent mode. I’m a fan of switch independent mode. Now possibly even more. Speeds for live migrating 6 VMs simultaneously with 9GB of memory were 28-30 seconds.image

image

The CPU load not very low but RSS does it’s job to spread it out.image

image

Now the beauty of al this is that this had no negative impact due to out of order packets. For one a single live migration sticks to a single team member. Here’s a screenshot of a single VM live migrated over a NIC Team with Dynamic load balancing.image

image

As you can see there will not be out of order packets in this case.

Secondly the Dynamic load balancing mode is based on the “flowlets”. This means that the impact due to out or order /reordering of TCP/IP packets is minimal.

I also refer you to the following article Dynamic Load Balancing Without Packet Reordering.The conclusion is quite interesting:

We have introduced the concept of flowlet-switching and developed an algorithm which utilizes flowlets in traffic splitting. Our work reveals several interesting conclusions. First,highly accurate traffic splitting can be implemented with little to no impact on TCP packet reordering and with negligible state overhead. Next, flowlets can be used to make load balancing more responsive, and thus help enable a new generation of real-time adaptive traffic engineering. Finally, the existence and usefulness of flowlets show that TCP burstiness is not necessarily a bad thing, and can in fact be used advantageously.

And now as a show closer let’s do live migrations between both hosts in both directions.image

Speed people, in live migration is a thing of beauty. Microsoft is really providing us with lots of options. This is good. We can use what’s available, where available, when available and make sure we get the best possible solution and performance whatever the environment and budget.

Preliminary Results With Live Migration Over RDMA Speed & Useful Number Of NICs


Introduction

With Windows Server 2012 R2 (Preview) we can leverage SMB to do Live Migrations. That means we can now offload the process to the NIC if they support RDMA, save on CPU cycles and potentially get VMs moves a lot faster without impacting the performance of running VMs on the involved hosts. Perhaps it’s even faster than over TCP/IP. Sounds great so let’s do some testing.

  • We have a dual port 10Gbps Mellanox RDMA card (RoCE) in each host. One pair of the ports are interconnected via a direct attach cable. The other one is connected over a Force10 S4810 switch. We’re using in box Windows Server 2012 R2 preview drivers for everything as we have found drivers not to install properly (or not at all) on this release and cause issues.
  • We are using one VM running Windows 2012 RTM with upgraded Integration Services components. This VM has 4 vCPUs and 55GB of fixed memory assigned. For this purpose we had no workload running in the VM. The servers are standard DELL PowerEdge R720 kit running the Windows Server 2012 Preview bits.

Results

No Performance tweaking

Live Migration over RDMA in action. Here we are using 1 10Gbps RoCE RDMA NIC. Here we are moving via the NIC port that goes over the S4810 Switch.image

As you can see the entire process took 74 seconds. RDMA did not kick in until after 19 seconds had past since the start.

The CPU load remains low, which is where you’ll find the biggest benefit of RDMA  with live migrations.image

No let’s put two RDMA RoCE ports into play and see what that does for us. We now Live Migrate the 55GB memory VM in 52-54 seconds. Not bad. Again we saw over 20 seconds time pass before RDMA kicks in.image

Again we see that CPU usage remains low. This is just a quick screenshot. On a hyper-V node you’ll need to dive into Performance Monitor to get some real info.image

Let’s repeat this exercise and see what happens if we move the traffic over the NOC ports that are directly attached. That will give us an indication about the configuration of the switch. Configuring RoCE DCB  features like PFC/ETS is not exactly a well documented process at the moment and often I feel like a magician’s apprentice.

Once more we see that it takes about 20 seconds for RDMA to kick in and that the time rises to 79 seconds. It fluctuates between 74-79 seconds actually?

image

The CPU load was low again. So both paths seem to perform comparable.

Live Migrations over SMB seem to function faster using two RDMA ports  but not twice as fast. These are the preview bits so nothing definitive yet. And sorry, I cannot do 40Gbps or 56Gbps Infiniband tests. Unless you want to donate the gear and pay for the power, time  & reporting Open-mouthed smile.

Max Performance Tweaking

As my readers very well know I tweak my nodes for best performance. The savings of energy (power, cooling) have to come from making the most out of every node and shutting them down when not needed (Dynamic Optimization/Power Optimization in System Center). I still have a standing order to tale away any physical limitations possible for the business.

While Windows Server 2012 (R2) has made tremendous strides to better use of the available bandwidth of a 10Gbps pipes out of the box I still dive in to the BIOS to turn of the C/C1E states and set the CPU Power Management and Memory Frequency to Maximum performance. Have a look at this blog post Still Need To Optimizing Power Settings On DELL 12th Generation Servers For Lightning Fast Hyper-V Live Migrations? on how to do this with DELL Generation 12 Servers. It also contains a  link to the older generations guidance.

As you can imagine I was quite interested to see if the settings effect RDMA as well. So let’s have a look with these settings here:

image

One RDMA NIC used (Mellanox, RoCE, 10Gbps)image

54 seconds for that 55GB memory (fixed) VM. We also note that the delay of 19-20 seconds before RDMA kicks in has dropped to 3-4 seconds, which is quite interesting. Basically this makes it as fast as 2 RDMA NICs without performance tweaking.

Two RDMA NICs used (Mellanox, RoCE, 10Gbps)image

30 seconds flat, in a repetitive manner, for that 55GB memory (fixed) VM. Again we note that the delay of 19-20 seconds before RDMA kicks in is again 3-4 seconds. So this is about 45% better than without the power optimization.

What is the CPU doing during all this? Well taking care of the VM load, not spending it on network interrupts Smile. Again, this is a quick screenshot. On a hyper-V node you’ll need to dive into Performance Monitor to get some real info.image

By now you must all be eager to see how this compares against Live Migration over TCP/IP, Multichannel and with Compression. That’s material for other blogs.

Why am I doing this?

We need to get the most out of every € or $ we spent. It’s not that we don’t have any cash left or so but why buy more servers & higher end gear to get better results when the answer lies in the correct configuration & better choices when designing a solution. It’s going to be a while before this knowledge becomes main stream and widely available. Years probably and why wait. It takes time to experiment but the results & ROI are great. Why spend another 50.000 to another 100.000 Euro on Servers, 10Gbps cards & switch ports if you don’t need to?  Count the cost to host, power & cool them and you’ll see that this time is an investment. You could also conclude to leverage the cloud but wasting VM cycles there is also money you have better uses for, so testing will also be needed.

An Early Look At Live Migration Over TCP/IP & Multichannel In Windows Server 2012 R2 Preview


Introduction

With Windows Server 2012 R2 (Preview) we can Live Migrations over TCP/IP like before. That’s either using a single NIC or by teaming two or more NICs. We also have compression and Multichannel. In this blog post we’ll play with TCP/IP and Multichannel.

  • We have a dual port 10Gbps Mellanox RDMA card (RoCE) in each host. But for these tests we have disable the RDMA capabilities of these NICs. As in the RMDA blog post, one pair of the ports are interconnected via a direct attach cable. The other one is connected over a Force10 S4810 switch. We’re using in box Windows Server 2012 R2 preview drivers for everything as we have found drivers not to install properly (or not at all) on this release and cause issues.
  • We are using one VM running Windows 2012 RTM with upgraded Integration Services components. This VM has 4 vCPUs and 55GB of fixed memory assigned. For this purpose we had no workload running in the VM. The servers are standard DELL PowerEdge R720 kit running the Windows Server 2012 Preview bits.

Results

No Performance tweaking

We test a a Live Migration over one 10Gbps NIC. It’s fast but I don’t like the jig saw effect and we don’t push the bandwidth to the limit yet.image

We can move the 55GB Memory VM in about 70 seconds on average. You have a bit more CPU load here but nothing to bad. Most often the Hyper-V host has ample of CPU cycles left so this will not hinder performance. I also remember Aidan Finn’s work testing a truck load of concurrent live migrations with a host that has only 1 low end CPU with 4 cores making it throttle the number of CPUs it would start to save guard the workload.

image

So let’s do what we’ve always done. Turn on Jumbo Frames. This helps to peek to 1.25GB/s and improves speeds (10% or more) but the jig saw is still a bit visible. As I think we can do better we move in the big guns and we optimize our power setting as discussed in Still Need To Optimizing Power Settings On DELL 12th Generation Servers For Lightning Fast Hyper-V Live Migrations? and  Optimizing Live Migrations with a 10Gbps Network in a Hyper-V Cluster. Now with C & C1E states disabled and both processor & memory optimized for performance we see this.image

Now that’s power. We have faster Live Migrations (54 seconds on average) with top bandwidth use during the entire migration process and we see 50% better blackout times. What’s not to like here? CPU usage isn’t that bad and you’ll likely have some cycles to spare unless you’re over 60-70% of CPU use by your VMs and then you need to fix that anyway Smile as you’re out of the save zone. So, Jumbo Frames & Power Optimization are key!

Of cause we’re always looking for better and more. In Live Migrations terms that means speed. So let’s see what Multichannel can do for us. So let’s switch to SMB. As we have disable RDMA on the NICs this “only” gives us multichannel. The cool thing is, the second NIC doesn’t have Jumbo frames enabled yet. I have always found Jumbo Frames to matter and now with multichannel I have a very nice way of demonstrating  / visualizing this. Here’s a screen shot of moving our test VM back and forward. As you can see we have one NIC with Jumbo frames disabled and one with Jumbo frames enabled. You don’t have to guess which one is which I guess. Yup Jumbo frames do matter Smile When you push to the limits. We are getting about 31 seconds on average here with the 55GB VM.

image

Here’s the same with Jumbo Frames enabled on both NICs. And guess what we just cut another 3 seconds of the Live Migration time Smile. 28 seconds flat.

image

In a histogram it looks like this. That’s what maximum throughput looks like.image

Let’s see what our CPUs are up to during all this. Some core are rather busy dealing with the interrupts. But this is just one VM.image

If you wonder why with 2*10Gbps you only see 2*4 CPUs doing work while the default RSS queues are at 8  and you’d expect 16. It’s because Multichannel defaults to 4. So we get 8. This I configurable and testing will show what difference this could make and whether it’s wise to tweak. It all depends.

Sure this is only one large memory VM but what if we do more? Like 6 VMs with 9GB of memory. Not to bad. image

image

What if that host is running  30 or 40 VMs? That adds up. Well that’s what RDMA is for Smile! But that yet another blog post.

Do keep in mind this is al just the Preview bits … MSFT does two things now until R2 is released. They kill bugs and tweak for speed. I tune my Live Migration setting in production so that get the most bang for the buck I try to avoid dips in bandwidth like you see above. So the work is not finished yet Smile

Conclusion

I can conclude that all the hints & tips of the past to optimize Live Migration still hold true. Yes, you should enable Jumbo Frames and yes you should still optimize your host for performance over power savings. That said, the times that you’d only get 16% of bandwidth usage out of a 10Gbps NIC when you do power optimize have long gone ever since Windows Server 2012. But if you feel the need for (even more) speed …, then by all means go for it.vlcsnap-2013-07-06-17h18m58s175

If you want to conserve energy & be environmentally sound make the most of the least number of nodes possible and use Dynamic Optimization / Power Optimization to shut them down when not needed and fire them up to rise to the occasion Smile

Oh yes, test people, test. Trust but verify and determine the best possible configuration for both your environment and needs.

Now we’ll have a look at compression  … but again that’s another blog post!