DCB PFC Demo with SMB Direct over RoCE (RDMA)


In this blog post we’ll demo Priority Flow control. We’re using the demo comfit as described in SMB Direct over RoCE Demo – Hosts & Switches Configuration Example

There is also a quick video to illustrate all this on Vimeo. It’s not training course grade I know, but my time to put into these is limited.

I’m using Mellanox ConnectX-3 ethernet cards, in 2 node DELL PowerEdge R720 Hyper- cluster lab. We’ve configured the two ports for SMB Direct & set live migration to leverage them both over SMB Direct. For that purpose we tagged SMB Direct traffic with priority 4 and all other traffic with priority 1. We only made priority lossless as that’s required for RoCE and the other traffic will deal with not being lossless by virtue of being TCP/IP.

Priority Flow Control is about making traffic lossless. Well some traffic. While we’d love to live by Queens lyrics “I want it all, I want it all and I want it now” we are limited. If not so by our budgets, than most certainly by the laws of physics. To make sure we all understand what PFC does here’s a quick reminder: It tells the sending party to stop sending packets, i.e. pause a moment (in our case SMB Direct traffic) to make sure we can handle the traffic without dropping packets. As RoCE is for all practical purposes Infiniband over Ethernet and is not TCP/IP, so you don’t have the benefits of your protocol dealing with dropped packets, retransmission … meaning the fabric has to be lossless*. So no it DOES NOT tell non priority traffic to slow down or stop. If you need to tell other traffic to take a hike, you’re in ETS country 🙂

* If any switch vendor tells you to not bother with DCB and just build (read buy their switches = $$$$$) a lossless fabric (does that exist?) and rely on the brute force quality of their products to have a lossless experience … could be an interesting experiment Smile.

Note: To even be able to start SMB Direct SMB Multichannel must be enabled as this is the mechanism used to identify RDMA capabilities after which a RDMA connection is attempted. If this fails you’ll fall back to SMB Multichannel. So you will have ,network connectivity.

You want RDMA to work and be lossless. To visualize this we can turn to the switch where we leverage the counter statistics to see PFC frames being send or transmitted. A lab example from a DELL PowerConnect 8100/N4000 series below.

image

To verify that RDMA is working as it should we should also leverage the Mellanox Adapter Diagnostic and native Windows RDMA Activity counters. First of all make sure RDMA is working properly. Basically you want the error counters to be zero and stay that way.

Mellanox wise these must remain at zero (or not climb after you got it right):

  • Responder CQE Errors
  • Responder Duplicate Request Received
  • Responder Out-Of-Order Sequence Received
  • … there’s lots of them …

image

Windows RDMA Activity wise these should be zero (or not climb after you got it right):

  • RDMA completion Queue Errors
  • RDMA connection Errors
  • RDMA Failed connection attempts

image

The event logs are also your friend as issues will log entries to look out for like

PowerShell is your friend (adapt severity levels according to your need!)

Get-WinEvent -ListLog “*SMB*” | Get-WinEvent | ? { $_.Level -lt 4 -and $_. Message -like “*RDMA*” } | FL LogName, Id, TimeCreated, Level, Message

Entries like this are clear enough, it ain’t working!

The network connection failed.
Error: The I/O request was canceled.
Connection type: Rdma
Guidance:
This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks port 445 or 5445 can also cause this issue.
 
RDMA interfaces are available but the client failed to connect to the server over RDMA transport.
Guidance:
Both client and server have RDMA (SMB Direct) adaptors but there was a problem with the connection and the client had to fall back to using TCP/IP SMB (non-RDMA).

 

To view PFC action in Windows we rely on the Mellanox Adapter QoS Counters

image

Below you’ll see the number of  pause frames being sent & received on each port. Click on the image to enlarge.

image

An important note trying to make sense of it all: … pauze and receive frames are sent and received hop to hop. So if you see a pause frame being sent on a server NIC port you should see them being received on the switch port and not on it’s windows target you are live migrating from. The 4 pause frames sent in the screenshot above are received by the switchport as you can see from the PFC Stats for that port.

image

People, if you don’t see errors in the error counters and event viewer that’s good. If you see the PFC Pause frame counters move up a bit that’s (unless excessive) also good and normal, that PFC doing it’s job making sure the traffic is lossless. If they are zero and stay zero for ever you did not buy a lossless fabric that doesn’t need DCB, it’s more likely you DCB/PFC is not working Winking smile and you do not have a lossless fabric at all. The counters are cumulative over time so they don’t reset to zero bar resetting the NIC or a reboot.

image

When testing feel free to generate lots of traffic all over the place on the involved ports & switches this helps with seeing all this in action and verifying RDMA/PFC works as it should. I like to use ntttcp.exe to generate traffic, the most recent version will let you really put a load on 10GBps and higher NICs. Hammer that network as hard as you can Winking smile.

Again a simple video to illustrate this on Vimeo.

Advertisements

SMB Direct over RoCE Demo – Hosts & Switches Configuration Example


As mentioned in Where SMB Direct, RoCE, RDMA & DCB fit into the stack this post’s only function is to give you an overview of the configurations used in the demo blogs/videos. First we’ll configure one Windows Server 2012 R2 host. I hope it’s clear this needs to be done on ALL hosts involved. The NICs we’re configuring are the 2 RDMA capable 10GbE NICs we’ll use for CSV traffic, live migration and our simulated backup traffic. These are Mellanox ConnectX-3 RoCE cards we hook up to a DCB capable switch. The commands needed are below and the explanation is in the comments. Do note that the choice of the 2 policies, priorities and minimum bandwidths are for this demo. It will depend on your environment what’s needed.

#Install DCB on the hosts
Install-WindowsFeature Data-Center-Bridging
#Mellanox/Windows RoCE drivers don't support DCBx (yet?), disable it.
Set-NetQosDcbxSetting -Willing $False
#Make sure RDMA is enable on the NIC (should be by default)
Enable-NetAdapterRdma –Name RDMA-NIC1
Enable-NetAdapterRdma –Name RDMA-NIC2
#Start with a clean slate
Remove-NetQosTrafficClass -confirm:$False
Remove-NetQosPolicy -confirm:$False

#Tag the RDMA NIC with the VLAN chosen for PFC network
Set-NetAdapterAdvancedProperty -Name "RDMA-NIC-1" -RegistryKeyword "VlanID" -RegistryValue 110
Set-NetAdapterAdvancedProperty -Name "RDMA-NIC-2" -RegistryKeyword "VlanID" -RegistryValue 120

#SMB Direct traffic to port 445 is tagged with priority 4
New-NetQosPolicy "SMBDIRECT" -netDirectPortMatchCondition 445 -PriorityValue8021Action 4
#Anything else goes into the "default" bucket with priority tag 1 🙂
New-NetQosPolicy "DEFAULT" -default  -PriorityValue8021Action 1

#Enable PFC (lossless) on the priority of the SMB Direct traffic.
Enable-NetQosFlowControl -Priority 4
#Disable PFC on the other traffic (TCP/IP, we don't need that to be lossless)
Disable-NetQosFlowControl 0,1,2,3,5,6,7

#Enable QoS on the RDMA interface
Enable-NetAdapterQos -InterfaceAlias "RDMA-NIC1"
Enable-NetAdapterQos -InterfaceAlias "RDMA-NIC2"

#Set the minimum bandwidth for SMB Direct traffic to 90% (ETS, optional)
#No need to do this for the other priorities as all those not configured
#explicitly goes in to default with the remaining bandwith.
New-NetQoSTrafficClass "SMBDirect" -Priority 4 -Bandwidth 90 -Algorithm ETS

We also show you in general how to setup the switch. Don’t sweat the exact syntax and way of getting it done. It differs between switch vendors and models (we used DELL Force10 S4810 and PowerConnect 8100 / N4000 series switches), it’s all very alike and yet very specific. The important thing is that you see how what you do on the switches maps to what you did on the hosts.

!Disable 802.3x flow control (global pause)- doesn't mix with DCB/PFC
workinghardinit#configure
workinghardinit(conf)#interface range tengigabitethernet 0/0 -47 
workinghardinit(conf-if-range-te-0/0-47)#no flowcontrol rx on tx on
workinghardinit(conf-if-range-te-0/0-47)# exit
workinghardinit(conf)# interface range fortyGigE 0/48 , fortyGigE 0/52
workinghardinit(conf-if-range-fo-0/48-52)#no flowcontrol rx on tx off
workinghardinit(conf-if-range-fo-0/48-52)#exit

!Enable DCB & Configure VLANs
workinghardinit(conf)#service-class dynamic dot1p
workinghardinit(conf)#dcb enable
workinghardinit(conf)#exit
workinghardinit#copy running-config startup-config
workinghardinit#reload

!We use a <> VLAN per subnet
workinghardinit#configure
workinghardinit(conf)#interface vlan 110
workinghardinit (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
workinghardinit (conf-if-vl-vlan-id*)#tagged port-channel 3
workinghardinit(conf)#interface vlan 120
workinghardinit (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
workinghardinit (conf-if-vl-vlan-id*)#tagged port-channel 3
workinghardinit (conf-if-vl-vlan-id*)#exit


!Create & configure DCB Map Policy
workinghardinit(conf)#dcb-map SMBDIRECT
workinghardinit(conf-dcbmap-profile-name*)#priority-group 0 bandwidth 90 pfc on 
workinghardinit(conf-dcbmap-profile-name*)#priority-group 1 bandwidth 10 pfc off 
workinghardinit(conf-dcbmap-profile-name*)#priority-pgid 1 1 1 1 0 1 1 1
workinghardinit(conf-dcb-profile-name*)#exit 

!Apply DCB map to the switch ports & uplinks
workinghardinit(conf)#interface range ten 0/047
workinghardinit(conf-if-range-te-0/0-47)# dcb-map SMBDIRECT 
workinghardinit(conf-if-range-te-0/0-47)#exit
workinghardinit(conf)#interface range fortyGigE 0/48 , fortyGigE 0/52
workinghardinit(conf-if-range-fo-0/48,fo-0/52)# dcb-map SMBDIRECT
workinghardinit(conf-if-range-fo-0/48,fo-0/52)#exit
workinghardinit(conf)#exit
workinghardinit#copy running-config startup-config 

With the hosts and the switches configured we’re ready for the demos in the next two blog posts. We’ll show Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS) in action with some tips on how to test this yourselves.

SMB Direct with DCB, PFC, ETS … How do I know it works?!


A question that comes up over time, again and again, is how do you know SMB Direct is working. The question stems from a nagging feeling that configuring DCB is a bit of playing wizard’s apprentice and we might not completely know what we’re doing, i.e. lack of experience.

image

Many have suspected me of brewing up DCB configurations in a dark corner of the data center where no one else dares venture. But those are unsubstantiated rumors. But in coming blog posts we’ll address how to configure it end to end and we’ll show how to find out if it’s really working and how to test that.

Finding out if it really works, testing and monitoring isn’t magic. It boils down to using tools you know. Performance counters for RDMA Activity and SMB direct are natively available in Windows. Use them!The NIC vendors also provide very detailed counters, those are excellent and of great value when testing and confirming things work as they should. The latter is very important. Because after people are satisfied SMB Direct works they want to know if DCB is configured correctly. Does PFC work, are pause frames being send and received? Is it really lossless?  Does ETS really kick in when needed, do I get the minimum bandwidth I configured? These are very valid questions people struggle with. But the answer eludes many, almost like the question if the refrigerator light really goes out when you close the door.

It’s hard to do deep down in the network packets … that often requires a very specialized skillset and experience with packet analyzers etc. Nothing most of you can’t learn but often this is not a priority. But with some creativity and the performance counters on windows provided by the NIC vendors and the statistics counters on the switches you can demonstrate that both PFC & ETS doe work and kick in.

So in upcoming blogs & videos I’ll demonstrate the configuring SMB Direct over RoCE leveraging 2 parts of DCB:

  • PFC (Priority Flow Control) – mandatory for SMB Direct over RoCE
  • ETS (Enhanced Transmission Selection) – optional but I advise you to leveraged it for SMB Direct over RoCE

Actually, when doing true converged, no matter what route you go, QoS is not really optional any more.

The biggest challenge is to get people to wrap their heads around the concepts and it’s behavior. Once you do that you’ll understand how and why to configure it. It took me time and effort, there’s no way around it, but it’s well worth the effort.

Look, DCB is not 100% fully matured or perfect especially in large scale environments over > 2 or 3 hops. Frak, while I love tinkering, testing and playing with this stuff I have never been a “QoS first person”. If I can I thrown resources at the problem (CPU cycles; memory, bandwidth, …). QoS is like a gun. You only draw it when you must use it and than you’d better do it right otherwise you don’t touch it, bar for practice/training/ education. While perfection is not of this world and improvements are being worked on (ECN) it does work and deliver. How many of you had a large scale > 2 hops , > 20 switches deployment with FC, FCoE or iSCSI to worry about? So can it deliver what you need today in most scenarios? Yes! Can I fix the short comings of any random technologies? No. Can I leverage current technologies with great success despite this? Yes! So can you. There is a reason I get hired and paid. Trust me it’s not my looks, my bed side manner or charismatic appearance Winking smile.

Side note 1: I’m cannot possibly provide a switch configuration guide in a step by step fashion as the details vary by vendor, they can also be switch model/type specific and it all depends on your environment & needs. So no I cannot and will not attempt to write a bunch of these. This would be way too much work and way too expensive (time, hardware etc.), so unless I’m paid very generously to do so, you’re out of luck. It might be cheaper to hire me or to come to the free community sessions, presentations, ATE evenings and study up.

Help! Active Directory cannot read forest and domain functional level anymore or much ado about nothing


The other day I got a very worried request for support. Apparently the Forest and domain functional level of a Active Directory deployment could no longer be read. Nothing else was wrong, everything was working just fine. If I could have a quick look? So they shared the screen with me and this is what I saw.

image

And this …

image

Well that didn’t surprise me, they are supposed to be already on domain and forest functional level 2012 R2 as all their domain controllers where already on Windows 202 R2 for over a year. That error message is not right!  After being puzzled for a moment when it hit me. This was a Windows 2008 R2 host without updated tools!

Once they used the correct version of RSAT or checking on a Windows 2012 R2 host all was show to be well and the scare was over.

image

Nothing so see here, move along.

Optimizing Backups: PowerShell Script To Move All Virtual Machines On A Cluster Shared Volume To The Node Owing That CSV


When you are optimizing the number of snapshots to be taken for backups or are dealing with storage vendors software that leveraged their hardware VSS provider you some times encounter some requirements that are at odds with virtual machine mobility and dynamic optimization.

For example when backing up multiple virtual machines leveraging a single CSV snapshot you’ll find that:

  • Some SAN vendor software requires that the virtual machines in that job are owned by the same host or the backup will fail.
  • Backup software can also require that all virtual machines are running on the same node when you want them to be be protected using a single CSV snap shot. The better ones don’t let the backup job fail, they just create multiple snapshots when needed but that’s less efficient and potentially makes you run into issues with your hardware VSS provider.

image

VEEAM B&R v8 in action … 8 SQL Server VMs with multiple disks on the same CSV being backed up by a single hardware VSS writer snapshot (DELL Compellent 6.5.20 / Replay Manager 7.5) and an off host proxy Organizing & orchestrating backups requires some effort, but can lead to great results.

Normally when designing your cluster you balance things out a well as you can. That helps out to reduce the needs for constant dynamic optimizations. You also make sure that if at all possible you keep all files related to a single VM together on the same CSV.

Naturally you’ll have drift. If not you have a very stable environment of are not leveraging the capabilities of your Hyper-V cluster. Mobility, dynamic optimization, high to continuous availability are what we want and we don’t block that to serve the backups. We try to help out to backups as much a possible however. A good design does this.

If you’re not running a backup every 15 minutes in a very dynamic environment you can deal with this by live migrating resources to where they need to be in order to optimize backups.

Here’s a little PowerShell snippet that will live migrate all virtual machines on the same CSV to the owner node of that CSV. You can run this as a script prior to the backups starting or you can run it as a weekly scheduled task to prevent the drift from the ideal situation for your backups becoming to huge requiring more VSS snapshots or even failing backups. The exact approach depends on the storage vendors and/or backup software you use in combination with the needs and capabilities of your environment.

cls

$Cluster = Get-Cluster
$AllCSV = Get-ClusterSharedVolume -Cluster $Cluster

ForEach ($CSV in $AllCSV)
{
    write-output "$($CSV.Name) is owned by $($CSV.OWnernode.Name)"
    
    #We grab the friendly name of the CSV
    $CSVVolumeInfo = $CSV | Select -Expand SharedVolumeInfo
    $CSVPath = ($CSVVolumeInfo).FriendlyVolumeName

    #We deal with the  being and escape character for string parsing.
    $FixedCSVPath = $CSVPath -replace '\', '\'

    #We grab all VMs that who's owner node is different from the CSV we're working with
    #From those we grab the ones that are located on the CSV we're working with
      $VMsToMove = Get-ClusterGroup | ? {($_.GroupType –eq 'VirtualMachine') -and ( $_.OwnerNode -ne $CSV.OWnernode.Name)} | Get-VM | Where-object {($_.path -match $FixedCSVPath)} 
     
    ForEach ($VM in $VMsToMove)

    {
        write-output "`tThe VM $($VM.Name) located on $CSVPath is not running on host $($CSV.OwnerNode.Name) who owns that CSV"
        write-output "`tbut on $($VM.Computername). It will be live migrated."
        #Live migrate that VM off to the Node that owns the CSV it resides on
        Move-ClusterVirtualMachineRole -Name $VM.Name -MigrationType Live -Node $CSV.OWnernode.Name
    }

Now there is a lot more to discuss, i.e. what and how to optimize for virtual machines that are clustered. For optimal redundancy you’ll have those running on different nodes and CSVs. But even beyond that, you might have the clustered VMs running on different cluster, which is the failure domain.  But I get the remark my blogs are wordy and verbose so … that’s for another time Winking smile

In Defense of Switch Independent Teaming With Hyper-V


For many old timers (heck, that includes me) NIC teaming with LACP mode was the best of the best, at least when it comes to teaming options. Other modes often led to passive/active, less than optimal receiving network traffic aggregation. Basically, and perhaps over simplified, I could say the other options were only used if you had no other choice to get things to work. Which we did a lot … I used Intel’s different teaming modes for various reasons in the past (before we had MLAG, VLT, VPC, …). Trying to use LACP where possible was a good approach in the past in physical deployments and early virtualized environments when 1Gbps networking dominated the datacenter realm and Windows did not have native support for LBFO.

But even LACP, even in those days, had some drawbacks. It’s the most demanding form of teaming. For one it required switch stacking. This demands the same brand and type of switches and that means you have no redundancy during firmware upgrades. That’s bad, as the only way to work around that is to move all workload to another rack unit … if you even had the capability to do that! So even in days past we chose different models if teaming out of need or because of the above limitations for high availability. But the superiority of NIC teaming with LACP still stands for many and as modern switches support MLAG, VLT, etc. the drawback of stacking can be avoided. So does that mean LACP for NIC teaming is always the superior choice today?

Some argue it is and now they have found support in the documentation about Microsoft CPS system documentation about Microsoft CPS system. Look, even if Microsoft chose to use LACP in their solutions it’s based on their particular design and the needs of that design I do not concur that this is the best overall. It is however a valid & probably the choice for their specific setup. While I applaud the use of MLAG (when available to you a no or very low cost) to have all bases covered but it does not mean that LACP is the best choice for the majority of use cases with Hyper-V deployments. Microsoft actually agrees with me on this in their Windows Server 2012 R2 NIC Teaming (LBFO) Deployment and Management guide. They state that Switch Independent configuration / Dynamic distribution (or Hyper-V Port if on Hyper-V and if not on W2K12R2)  is the best possible default choice is for teaming in both native and Hyper-V environments. I concur, even if perhaps not that strong for native workloads (it depends). Exceptions to this:

  • Teaming is being performed in a VM (which should be rare),
  • Switch dependent teaming (e.g., LACP) is required by policy, or
  • Operation of a two-member Active/Standby team is required by policy.

In other words in 2 out of 3 cases the reason is a policy, not a technical superior solution …

Note that there are differences between Address Hash, Hyper-V Port mode & the new dynamic distribution modes and the latter has made things better in W2K12R2 in regards to bandwidth but you’ll need the read the white papers. Use dynamic as default, it is the best. Also note that LACP/Switch Dependent doesn’t mean you can send & receive to and from a VM over the aggregated bandwidth of all team members. Life is more complicated than that. So if that’s you’re main reason for switch dependent, and think you’re done => be ware Winking smile.

Switch Independent is also way better for optimization of VMQ. You have more queues available (sum-of-queues) and the IO path is very predictable & optimized.

If you don’t control the switches there’s a lot more cross team communication involved to set up teaming for your hosts. There’s more complexity in these configurations so more possibilities for errors or bugs. Operational ease is also a factor.

The biggest draw back could be that for receiving traffic you cannot get more than the bandwidth a single team member can deliver. That’s true but optimizing receiving traffic has it’s own demands and might not always be that great if the switch configuration isn’t that smart & capable. Do I ever miss the potential ability to aggregate incoming traffic. In real life I do not (yet) but in some configurations it could do a great job to optimize that when needed.

When using 10Gbps or higher you’ll rarely be in a situation where receiving traffic is higher than 10Gbps or higher and if you want to get that amount of traffic you really need to leverage DVMQ. And a as said switch independent teaming with port of dynamic mode gives you the most bang for the buck. as you have more queues available. This drawback is mitigated a bit by the fact that modern NICs have way larger number of queues available than they used to have. But if you have more than one VM that is eating close to 10Gbps in a non lab environment and you planning to have more than 2 of those on a host you need to start thinking about 40Gbps instead of aggregating a fistful of 10Gbps cables. Remember the golden rules a single bigger pipe is always better than a bunch of small pipes.

When using 1Gbps you’ll be at that point sooner and as 1Gbps isn’t a great fit for (Dynamic) VMQ anyway I’d say, sure give LACP a spin to try and get a bit more bandwidth but will it really matter? In native workloads it might but with a vSwith?  Modern CPUs eat 1Gbps NICs for breakfast, so I would not bother with VMQ. But when you’re tied to 1Gbps it’s probably due to budget constraints and you might not even have stackable, MLAG, VLT or other capable switches. But the arguments can be made, it depends (see Don’t tell me “It depends”! But it does!). But in any case I start saving for 10Gbps Smile

Today as the PC8100 series and the N4000 Series (budget 10Gbps switches, yes I know “budget” is relative but in the 10Gbps world, but they offer outstanding value for money), I tend to set up MLAG with two of these per rack. This means we have all options and needs covered at no extra cost and without sacrificing redundancy under any condition. However look at the needs of your VMs and the capability of your NICs before using LACP for teaming by default. The fact that switch independent works with any combination of budget switches to get redundancy doesn’t mean it’s only to be used in such scenarios. That’s a perk for those without more advanced gear, not a consolation price.

My best advise: do not over engineer it. Engineer it for the best possible solution for the environment at hand. When choosing a default it’s not about the best possible redundancy and bandwidth under certain conditions. It’s about the best possible redundancy and bandwidth under most conditions. It’s there that switch independent comes into it’s own, today more than ever!

There is one other very good, but luckily also a very rare case where LACP/Switch dependent will save you and switch independent won’t: dead switch ports, where the port becomes dysfunctional. So while switch independent protects against NIC, Switch, cable failures, here it doesn’t help you as it doesn’t know (it’s about link failures, not logical issues on a port).

For the majority of my Hyper-V deployments I do not use switch dependent / LACP. The situation where I did had to do with Windows NLB in combination with ICMP Multicast.

Note: You can do VLT, MLAG, stacking and still leverage switch independent teaming, LACP or static switch dependent is NOT mandatory even when possible.

GPS service issues resolved fast by Hyper-V & site resilience engineering


Diminished services on a GPS positioning network

The past couple of days there had been latencies negatively affecting a near real time GPS positioning service that allow the users the correct their GPS measurements in real time.

Flemish Positioning Service (FLEPOS)

That service is really handy when you’re a surveyor and it safes money by avoiding extra GIS post processing work later. It becomes essential however when you are relying on your GPS coordinates to farm automatically, fly aerial photogrammetry patterns, create mobile mapping data, build dams or railways, steer your dredging ships and maneuver ever bigger ships through harbor locks.

Flemish Positioning Service (FLEPOS)

It was clear this needed to be resolved. After checking for network issues we pretty much knew that the recently spiking CPU load was the cause. Partially due to the growth in users, more and more use cases and partially due to a new software version that definitely requires a few more CPU cycles.

The GPS positioning service is running on multiple virtual machines, on separate LUNS, on separate hosts, those hosts are on separate racks. All this is being replicated to a second data center. They have high to continuous availability with Microsoft Failover Clustering and leverage Kemp Loadmaster load balancing. Together with the operations team we moved the load away from every VM, shut it down, doubled the vCPU count and restarted the VM. Rinse and repeat until all VMS have been assigned more vCPUs.

The results where a dramatic improvement in the response times and services response times that went back to normal.

Breathing room with more vCPUs

They can move fast and efficient

All this was done fast. They have the power to decide and act to resolve such issues on our own responsibility. Now the fact that they operate in tight night team that span over bureaucracy, hierarchies and make sure that people who need to involved can communicate fats en effective (even if they are spread over different locations) makes this possible. They have a design for high availability and a vertically integrated approach to the solution stack that spans any resource (CPU, Memory, Storage, Network and Software) combined with a great app owner and rock solid operational excellence (Peopleware) to enable the Site Resilience side of the story. Fast & efficient.

I’m proud to have help design and deliver this service and I’ll be ready and willing to help design vNext of this solution in the near future. We moved it from hardware to a virtualized solution based on Hyper-V in 2008 and have not regretted one minute of it. The operational capabilities it offers are too valuable for that and banking on Hyper-V has proved to be a winner!

Would Hot Add CPU Capability have made this easier?

Yes, faster for sure Smile. The process they have now isn’t that difficult. Now would I not like hot add vCPU capabilities in Hyper-V? Yes, absolutely. I do realize however that not every application might be able to handle this without restarting making the exercise a bit of a moot point in those cases.

Why some people have not virtualized yet I do not know (try and double the CPUs on your hardware servers easily and fast without leaving the comfort of your home office). I do know how ever that you are missing out on a lot of capabilities & operational benefits.