Introducing 10Gbps & Integrating It into Your Network Infrastructure (Part 4/4)


This is a 4th post in a series of 4. Here’s a list of all parts:

  1. Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
  2. Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
  3. Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
  4. Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

In my blog post “Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)” in a series of thoughts on 10Gbps and Hyper-V networking a discussion on NIC teaming brought up the subject of 10Gbps for virtual machine networks. This means our switches will probably no longer exist in isolation unless those virtual Machines don’t ever need to talk to anything outside what’s connected to those switches. This is very unlikely. That means we need to start thinking and talking about integrating the 10Gbps switches in our network infrastructure. So we’re entering the network engineers their turf again and we’ll need to address some of their concerns. But this is not bad news as they’ll help us prevent some bad scenarios.

Optimizing the use of your 10Gbps switches

As not everyone runs clusters big enough, or enough smaller clusters, to warrant an isolated network approach for just cluster networking. As a result you might want to put some of the remaining 10Gbps ports to work for virtual machine traffic. We’ve already pointed out that your virtual machines will not only want to talk amongst themselves (it’s a cluster and private/internal networks tend to defeat the purpose of a cluster, it just doesn’t make any sense as than they are limited within a node) but need to talk to other servers on the network, both physical and virtual ones. So you have to hook up your 10Gbps switches from the previous example to the rest of the network. Now there are some scenarios where you can keep the virtual machine networks isolated as well within a cluster. In your POC lab for example where you are running a small 100% virtualized test domain on a cluster in a separate management domain but these are not the predominant use case.

But you don’t only have to have to integrate with the rest of your network, you may very well want to! You’ve seen 10Gbps in action for CSV and Live Migration and you got a taste for 10Gbps now, you’re hooked and dream of moving each and every VM network to 10Gbps as well. And while your add it your management network and such as well. This is nothing different from the time you first got hold of 1Gbps networking kit in a 100 Mbps world. Speed is addictive, once you’re hooked you crave for more Smile

How to achieve this? You could do this by replacing the existing 1Gbps switches. That takes money, no question about it. But think ahead, 10Gbps will be common place in a couple of years time (read prices will drop even more). The servers with 10Gbps LOM cards are here or will be here very soon with any major vendor. For Dell this means that the LOM NICs will be like mezzanine cards and you decide whether to plug in 10Gbps SPF+ or Ethernet jacks. When you opt to replace some current 1Gbps switches with 10gbps ones you don’t have to throw them away. What we did at one location is recuperate the 1Gbps switches for out of band remote access (ILO/DRAC cards) that in today’s servers also run at 1Gbps speeds. Their older 100Mbps switches where taken out of service. No emotional attachment here. You could also use them to give some departments or branch offices 1gbps to the desktop if they don’t have that yet.

When you have ports left over on the now isolated 10Gbps switches and you don’t have any additional hosts arriving in the near future requiring CSV & LM networking you might as well use those free ports. If you still need extra ports you can always add more 10Gbps switches. But whatever the case, this means up linking those cluster network 10Gbps switches to the rest of the network. We already mentioned in a previous post the network people might have some concerns that need to be addressed and rightly so.

Protect the Network against Loops & Storms

The last thing you want to do is bring down your entire production network with a loop and a resulting broadcast storm. You also don’t want the otherwise rather useful spanning tree protocol, locking out part of your network ruining your sweet cluster setup or have traffic specifically intended for your 10Gbps network routed over a 1Gbps network instead.

So let us discuss some of the ways in which we can prevent all these bad things from happening. Now mind you, I’m far from an expert network engineer so to all CCIE holders stumbling on to this blog post, please forgive me my prosaic network insights. Also keep in mind that this is not a networking or switch configuration course. That would lead us astray a bit too far and it is very dependent on your exact network layout, needs, brand and model of switches etc.

As discussed in blog post Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4) you need a LAG between your switches as the traffic for the VLANs serving heartbeat, CSV, Live Migration, virtual machines but now also perhaps the host management and optional backup network must flow between the switches. As long as you have only two switches that have a LAG between them or that are stacked you have not much risk of creating a loop on the network. Unless you uplink two ports directly with a network cable. Yes, that happens, I once witnessed a loop/broadcast storm caused by someone who was “tidying up” spare CAT5E cables buy plugging all ends up into free port switches. Don’t ask. Lesson learned: disable every switch port not in use.

Now once you uplink those two or more 10Gbps switches to your other switches in a redundant way you have a loop. That’s where the Spanning Tree protocol comes in. Without going into detail this prevents loops by blocking the redundant paths. If the operational path becomes unavailable a new path is established to keep network traffic flowing. There are some variations in STP. One of them is Rapid Spanning Tree Protocol (RSTP) that does the same job as STP but a lot faster. Think a couple of seconds to establish a path versus 30 seconds or so. That’s a nice improvement over the early days. Another one that is very handy is the Multiple Spanning Tree Protocol (MSTP). The sweet thing about the latter is that you have blocking per VLANs and in the case of Hyper-V or storage networks this can come in quite handy.

Think about it. Apart from preventing loops, which are very, very bad, you also like to make sure that the network traffic doesn’t travel along unnecessary long paths or over links that are not suited for its needs. Imagine the Live Migration traffic between two nodes on different 10Gbps switches travelling over the 1Gbps uplinks to the 1Gbps switches because the STP blocked the 10Gbps LAG to prevent a loop. You might be saturating the 1Gbps infrastructure and that’s not good.

I said MSTP could be very handy, let’s address this. You only need the uplink to the rest of the network for the host management and virtual machine traffic. The heartbeat, CSV and Live Migration traffic also stops flowing when the LAG between the two 10Gbps switches is blocked by the RSTP. This is because RSTP works on the LAG level for all VLANs travelling across that LAG and doesn’t discriminate between VLAN. MSTP is smarter and only blocks the required VLANS. In this case the host management and virtual machines VLANS as these are the only ones travelling across the link to the rest of the network.

We’ll illustrate this with some picture based on our previous scenarios. In this example we have the cluster network going to the 10Gbps switches over non teamed NICs. The same goes for the virtual machine traffic but the NICs are teamed, and the host management NICS. Let’s first show the normal situation.

 clip_image002

Now look at a situation where RSTP blocks the purple LAG. Please do note that if the other network switches are not 10Gbps that traffic for the virtual machines would be travelling over more hops and at 1Gbps. This should be avoided, but if it does happens, MSTP would prevent an even worse scenario. Now if you would define the VLANS for cluster network traffic on those (orange) uplink LAGs you can use RSTP with a high cost but in the event that RSTP blocks the purple LAG you’d be sending all heartbeat, CSV and Live Migration traffic over those main switches. That could saturate them. It’s your choice.

clip_image004

In the picture below MSTP saves the day providing loop free network connectivity even if spanning tree for some reasons needs to block the LAG between the two 10Gbps switches. MSTP would save your cluster network traffic connectivity as those VLAN are not defined on the orange LAG uplinks and MSTP prevents loops by blocking VLAN IDs in LAGs not by blocking entire LAGs

clip_image006

To conclude I’ll also mention a more “star like” approach in up linking switches. This has a couple of benefits especially when you use stackable switches to link up to. They provide the best bandwidth available for upstream connections and they provide good redundancy because you can uplink the lag to separate switches in the stack. There is no possibility for a loop this way and you have great performance on top. What’s not to like?

clip_image008

Well we’ve shown that each network setup has optimal, preferred network traffic paths. We can enforce these by proper LAG & STP configuration. Other, less optimal, paths can become active to provide resiliency of our network. Such a situation must be addressed as soon as possible and should be considered running on “emergency backup”. You can avoid such events except for the most extreme situations by configuring the RSTP/MSTP costs for the LAG correctly and by using multiple inter switch links in every LAG. This does not only provide for extra bandwidth but also protects against cable or port failure.

Conclusion

And there you have it, over a couple of blog posts I’ve taken you on a journey through considerations about not only using 10Gbps in your Hyper-V cluster environments, but also about cluster networks considerations on a whole. Some notes from the field so to speak. As I told you this was not a deployment or best practices guide. The major aim was to think out loud, share thoughts and ideas. There are many ways to get the job done and it all depends on your needs an existing environment. If you don’t have a network engineer on hand and you can’t do this yourself; you might be ready by now to get one of those business ready configurations for your Hyper-V clustering. Things can get pretty complex quite fast. And we haven’t even touched on storage design, management etc.. The purpose of these blog was to think about how Hyper-V clustering networks function and behave and to investigate what is possible. When you’re new to all this but need to make the jump into virtualization with both feet (and you really do) a lot help is available. Most hardware vendors have fast tracks, reference architectures that have a list of components to order to build a Hyper-V cluster and more often than not they or a partner will come set it all up for you. This reduces both risk and time to production. I hope that if you don’t have a green field scenario but want to start taking advantage of 10Gbps networking; this has given you some food for thought.

I’ll try to share some real life experiences,what improvements we actually see, with 10Gbps speeds in a future blog post.

Experts2Experts Virtualization Conference London 2011 Selling Out Fast!


It seems a lot of Hyper-V expertise is converging on London in from November 2011 for the small but brilliant Experts2Experts Virtualization Conference. I’m looking forward to learning a lot from them and listening to real world experiences of people who deal with the technologies on a daily basis. It will also be nice to meet up with a lot of on line acquaintances from the blogosphere and twitter. The conference is selling out fast. That’s due to the quality, small scale and very economic attendance fee. So if you want to meet up with and listen to expertise the likes of Aidan Finn, Jeff Wouters, Carsten Rachfal, Ronnie Isherwood and hopefully Kristian Nese have to share you’d better hurry up and register right now.

I’ll be sharing some musings on “High Performance & High availability Networks for Hyper-V Clusters”.

Perhaps we’ll meet.

Hyper-V Cluster Nodes Upgrade: Zero Down Time With Intel VT FlexMigration


Well the oldest Hyper-V cluster nodes are 3 + years old. They’ve been running Hyper-V clusters since RTM of Hyper-V for Windows 2008 RTM. Yes you needed to update the “beta” versions to the RTM version of Hyper-V that came later Smile Bit of a messy decision back then but all in all that experience was painless.

These nodes/clusters have been upgraded to W2KR2 Hyper-V clusters very soon after that SKU went RTM but now they have reached the end of their “Tier 1” production life. The need for more capacity (CPU, memory) was felt. Scaling out was not really an option. The cost of fiber channel cards is big enough but fiber channel switch ports need activation licenses and the cost for those border on legalized extortion.

So upgrading to more capable nodes was the standing order. Those nodes became DELL R810 servers. The entire node upgrade process itself is actually quite easy. You just live migrate the virtual machines over to clear a host that you then evict from the cluster. You recuperate the fiber channel HBAs to use in the new node that you than add to the cluster. You just rinse and repeat until you’re done with all nodes. Thank you Microsoft for the easy clustering experience in Windows 2008 (R2)! Those nodes now also have 10Gbps networking kit to work with (Intel X520 DA SPF+).

If you do your home work this process works very well. The cool thing there is not much to do on the SAN/HBA/Fiber Switch configuration side as you recuperate the HBA with their World Wide Names. You just need to updates some names/descriptions to represent the new nodes. The only thing to note is that the cluster validation wizard nags about inconsistencies in node configuration, service packs. That’s because the new nodes are installed with SP1 integrated as opposes to the original ones having been upgraded to SP1 etc.

The beauty is that by sticking to Intel CPUs we could live migrate the virtual machines between nodes having Intel E5430 2.66Ghz CPUs (5400-series "Harpertown") and those having the new X7560 2.27Ghz CPUs (Nehalem EX “Beckton”). There was no need to use the “Allow migration to a virtual machine with a different processor” option.  Intel’s investment (and ours) in VT FlexMigration is paying of as we had a zero down time upgrade process thanks to this.

image

You can read more about Intel VT FlexMigration here

And in case you’re wondering. Those PE2950 III are getting a second life. Believe it or not there are software vendors that don’t have application live cycle management, Virtualization support or roadmaps to support. So some hardware comes in handy to transplant those servers when needed. Yes it’s 2011 and we’re still dealing with that crap in the cloud era. I do hope the vendors of those application get the message or management cuts the rope and lets them fall.

Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)


This is a 3th post in a series of 4. Here’s a list of all parts:

  1. Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
  2. Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
  3. Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
  4. Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

As you saw in my previous blog post “Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)” we created an isolated network for Hyper-V cluster networking needs, i.e. Heartbeat, Cluster Shared Volume and Live Migration traffic. When you set up failover clustering you’re doing so to achieve some level of high availability. We did this by using 2 switches and setting up redundant paths to them, making use of the fault tolerance the cluster networks offer us. The darks side of high availability is that is always exposes the next single point of failure and when it comes to networking that means you’ll need redundant NICs, NIC ports, cabling and switches. That’s what we’ll discuss in this blog post. All the options below are just that. There is never an obligation to use them everywhere and it might be not needed depending on the type of network and the business needs we’re talking about. But one thing I have learned is to build options into your solutions. You want ways and opportunities to work around issues while you fix them.

Redundant Switches

The first thing you’ll need to address is the loss of a switch. The better ones have redundant power supplies but that’s about it. So you’ll need to have (at least) two switches and make sure you have redundant connections to both switches. That implies both switches can talk to each other as they form one functional unit even when it is an isolated network as in our example.

One of the ways we can achieve this is by setting up a Link Aggregation Group (LAG) over Inter Switch Links (ISL). The LAG makes all the connections available between the switches for the VLANs you define. There are different types of LAG but one of the better ones is a LAG with LACP.

Stacking your switches might also be a solution if they support that. You might need stacking modules for that. Basically this turns two or more switches into one big switch. One switch in the stack acts as the master switch that maintains the entire stack and provides a single configuration and monitoring point. If a switch in the stack fails the remaining switches will bypass the failed switch via the stacking modules. Depending on the quality of your network equipment you can have some disruption during a the failure of the master switch as then another switch needs to take on that role and this can take anything between 3 seconds and a minute depending on vendor, type, firmware, etc. Network people like this. And as each switch contains the entire the stack configuration it’s very easy to replace a dead switch in a stack. Just rip out the dead one, plug in the replacement one and the stack will do the rest.

We note that more people have access to switches that can handle LAGs versus those who have stackable ones. The reason for this is that the latter tend to be more pricy.

Redundant Network Cards & Ports

Now whether you’re using LAGs or stacking the idea is that you connect your NICs to different switches for redundancy. The question is do we need to do something with the NIC configuration or not to benefit from this? Do we have redundancy in via a cluster wide virtual switch or not? If not can we use NIC teaming? Is NIC teaming always needed or a good idea? Ok, let’s address some of these questions.

First of all, Hyper-V in the current Windows Server 2008 SP1 version has no cluster wide virtual switch that can provide redundancy for your virtual machine network(s). But please allow me to dream about Hyper-V 3.0. To achieve redundancy for the virtual machine networks you’ll need to turn to NIC teaming. NIC teaming has various possible configurations depending on vendor and the capabilities of the switches in use. You might be familiar with terminology like Switch Fault Tolerance (SFT), Adaptive Fault Tolerance (AFT), Link Aggregation Control protocol (LACP), VM etc. Apart from all that the biggest thing to remember is that NIC teaming support has to come from the hardware vendor(s). Microsoft doesn’t support it directly for Hyper-V and Hyper-V gets assess to a NIC team NIC via the Windows operating system.

On NIC Teaming

I’m going to make a controversial statement. NIC teaming can be and is often a cause of issues and it can expensive in time to both set up and fix if it fails. Apart from a lot of misconceptions and terminology confusion with all the possible configurations we have another issue. NIC teaming introduces complexity with drivers & software that is at least a hundred fold more likely to cause failures than today’s high quality network cards. On top of that sometimes people forget about the proper switch configuration. Ouch!

Do a search on Hyper-V and NIC teaming and you’ll see the headaches it causes so many people. Do you need to stay away from it? Is it evil? No, I’m not saying that. Far from that, NIC teaming is great. You need to decide carefully where and when to use it and in what form. Remember when you can handle & manage the complexity need to achieve high availability, generally speaking you’re good to go. If complexity becomes a risk in itself, you’re on the wrong track.

Where do I stand on NIC teaming? Use it when it really provides the benefits you seek. Make sure you have the proper NICs, Switches and software/drivers for what you’re planning to do. Do your research and test. I’ve done NIC Teaming that went so smooth I never would have realized the headaches it can give people. I’ve done NIC teaming where buggy software and drivers drove me crazy.

I’d like to mention security here. Some people tend to do a lot of funky, tedious configurations with VLANS in an attempt to enhance security. VLANS are not security mechanisms. They can be used in a secure implementation but by themselves they achieve nothing. If you’re doing this via NIC teaming/VLANS I’d like to note that once someone has access to your Hyper-V management console and /or the switches you’re toast. Logical and physical security cannot be replaced or ignored.

NIC Teaming To Enhance Throughput

You can use NIC teaming enhance bandwidth/throughput. If this is you major or only goal, you might not even be worried about using multiple switches. Now NIC teaming does help to provide better bandwidth but, sure but nothing beats buying 10Gbps switches & NICs. Really, switches with LAGs or stack and NIC Teaming are great but bigger pipes are always better for raw throughput. If you need twice or quadruple the ports only for extra bandwidth this gets expensive very fast. And if, on top of that, you need consultants because you don’t have a network engineer to set it all up just for that purpose, save your money and invest it in hardware.

NIC Teaming For Redundancy

Do you use NIC teaming for redundancy? Yes, this is a very good reason when it fits the needs. Do you do this for all networks? No, it depends. Just for heartbeat, CSV & Live migration traffic it might be overkill. The nature of these networks in a Hyper-V Cluster is such that you don’t really need it as they can mutually provide redundancy for each other. But what if a NIC port fails when I’m doing a live migration? Won’t that mean the live migration will fail? Yes. But once the NIC is out of the picture Live Migration will just work over the CSV network if you set it up that way. And you’re back in business while you fix the issue. Have I seen live migration fail? Yes, sure. But it never left the VM messed up, that kept running. So you fix the issue and Live Migrate it again.

The same goes for the other networks. CSV should not give you worries. That traffic gets queued and send to the next available network available for CSV. Heart beat is also not an issue. You can afford the little “down time” until it is sent over the next available network for cluster communication. Really a properly set up cluster doesn’t go down when a cluster networks fails if you have multiple of them.

But NIC teaming could/would prevent even this ever so slight interruption you say. It can, yes, depending on how you set this up, so not always by definition. But it’s not needed. You’re preventing something benign at great cost. Have you tested it? Is it always a lossless, complete transparent failover? No a single packet dropped? Not one ping failed? If so, well done! At what cost and for what profit did you do it? How often do your NICs and switch ports fail? Not very often. Also remember the extra complexity and the risk of (human) configuration errors. As always, trust but verify, testing is your friend.

Paranoia Is Your Friend

If you set up NIC teaming without separate NIC cards (not ports) and the PCIe slot goes bad NIC teaming won’t save you. So you need multiple network cards. On top of that, if you decide to run all networks over that team you put all your eggs in that one basket. So perhaps you might need 2 teams distributed over multiple NIC cards. Oh boy redundancy and high availability do make for expensive setups.

Combine NIC Teaming & VLANS Work Around Limited NIC Ports

This can be a good idea. As you’ll be pushing multiple networks (VLANs) over the same pipe you want redundancy. So NIC teaming here can definitely help out. You’ll need to consider the amount of network traffic in this case as well. If you use load balancing NIC teaming you can get some extra bandwidth, but don’t expect miracles. Think about the potential for bottle necks, QoS and try to separate bandwidth hogs on separate teams. And remember, bigger pipes are always better, so consider 10Gbps when you are in a bandwidth crunch.

Don’t Forget About The Switches

As a friendly reminder about what we already mentioned above, don’t forget to use different switches for up linking the NIC ports. If you do forget your switch is the single point of failure (SPOF). Welcome to high availability: always hitting the net SPOF and figuring out how big the risk is versus the cost in money and complexity to deal with it J. Switches don’t often fail but I’ve seen sys admins pull out the wrong PDU cables. Yes human error lures in all corners in all possible variations. I know this would never happen to you, and certainly not twice, but other people are not so skillful. And for those who’d rather be lucky than good I have bad news. Luck runs out. Inevitably bad things happen to all of our systems.

Some Closings Thoughts

One rule of thumb I have is not to use NIC teaming to save money by reducing NIC Cards, NIC ports, number of switches or switch ports. Use it when it serves your needs and procure adequate hardware to achieve your goals. You should do it because you have a real need to provide the absolute best availability and then you put down the money to achieve it. If you talk the talk, you need to walk the walk. And while not the subject of this post, your Active Directory or other core infrastructure services are not single points of failure , are they? Winking smile

If you do want to use it to save money or work around a lack of NIC ports, there is nothing wrong with that, but say so and accept the risk. It’s a valid decision when you have you have your needs covered and are happy with what that solution provides.

When you take all of this option into consideration, where do you end with NIC teaming and network solutions for Hyper-V clusters? You end up with the “Business ready” or “reference architecture” offered by DELL or HP. They weigh all pros and cons against each other and make a choice based on providing the best possible solution for the largest number of customers at acceptable costs. Is this the best for you? That could very well be. It all depends. They make pretty good configurations.

I tend to use NIC teaming only for the Virtual Machine networks. That’s where the biggest potential service interruption exists. I have in certain environments when NIC teaming was something that was not chosen mediated that risk by providing 2 or 3 single NIC for 2 or 3 virtual networks in Hyper-V. That reduced the impact to 1/3 of the virtual machines. And a fix for a broken NIC is easy; just attach the VMs to a different virtual network. You can do this while the virtual machines are running so no shutdown is required. As an added benefit you balance the network traffic over multiple NICs.

10Gbps with NIC teaming and VLANs provide for some very nice scenarios. This is especially true especially if you have bandwidth hungry applications running in boatload of VMs. This all means that we need to start thinking and talking about integrating the 10Gbps switches in our network infrastructure. So that means we’re entering the network engineers their turf and we’ll need to address some of their concerns. But this is not bad news as they’ll help us prevent some bad scenarios. But that will be discussed in a next blog post.

10Gbps Bragging Right Nothing More


Well, when you get to play with some 10Gbps network gear, experiment a little you see some pretty nice file copy transfer times. But things are not very consistent. “It all depends”. So do experiment a lot & test things out. At a given moment you get rewarded with this:

image

That’s right.The most successful experiment was 61% of a 10Gbps pipe used and the file transfer speed for a 25GB vhd file was in one word amazing. We used jumbo frames and we disabled (same old story) except for the fact that this kind of throughput turns the receiving server in shell shocked piece of hardware. During the 35 seconds it could get anything else done, bar for those poor disks. Perhaps you’ll want to throttle it down a little in a production environment Winking smile

Virtualization with Hyper-V & The NUMA Tax Is Not Just About Dynamic Memory


First of all to be able to join in this little discussion you need to know what NUMA is and does. You can read up on that on the Intel (or AMD) web site like http://software.intel.com/en-us/blogs/2009/03/11/learning-experience-of-numa-and-intels-next-generation-xeon-processor-i/ and http://software.intel.com/en-us/articles/optimizing-software-applications-for-numa/. Do have a look at the following SQL Skills Blog http://www.sqlskills.com/blogs/jonathan/post/Understanding-Non-Uniform-Memory-AccessArchitectures-(NUMA).aspx which has some great pictures to help visualize the concepts.

What Is It And Why Do We Care?

We all know that a CPU contains multiple cores today. 2,4,6,8,12,16 etc. cores. So in terms of a physical CPU we tend to talk about a processor that fits in a socket and about cores for logical CPUs. When hyper threading is enabled you double the logical processors seen and used. It is said that Hyper-V can handle hyper threading so you can leave it on. The logic being that it will never hurt performance and can help to improve it. I suggest you test it Smile as there was a performance bug with it once.  A processor today contains it own memory controller and access to memory from that processor is very fast. The NUMA node concept is older than the multi core processor technology but today you can state that a NUMA node translates to one processor/socket and all cores contained in that processor belong to the same NUMA node. Sometimes a processors contains two NUMA node like the AMD 12 core processors. In the future, with the ever increasing number of cores, we’ll perhaps see even more NUMA nodes per processor. You can state that all Intel processors since Nehalem with Quick Path Interconnect and AMD processors with Hyper-Transport are NUMA processors. But To be sure, check with your vendors before buying. Assumptions right?

Beyond NUMA nodes there is also a thing called processor groups which help Windows to use more than 64 logical processors (its former limit) by grouping logical processors into groups of which Windows handle 4 meaning in total Windows today can support 4*64=256 logical processors. Due to the fact that memory access within a NUMA node is a lot faster than between NUMA nodes you can see where a potential performance hit is waiting to happen. I tried to create a picture of this concept below. Now you know why I don’t make my living as a graphical artist Eye rolling smile

imageimage

 

To make it very clear NUMA is great and helps us in a lot of ways. But under certain conditions and with certain applications it can cause us to take a (serious) performance hit. And if there is anything certain to ruin a system administrators day than it is a brand new server with a bunch of CPUs and loads of RAM that isn’t running any better (or worse?) than the one you’re replacing. Current hyper visors like Hyper-V are NUMA aware and the better servers like SQL Server are as well. That means that under the hood they are doing their best to optimize the CPU & memory usage for performance. They do an very good job actually and you might, depending on your environment never, ever know of any issue or even the existence of NUMA.

But even with a NUMA knowledgeable hyper visor and NUMA aware applications you run the risk of having to go to remote memory. The introduction of Dynamic Memory in Windows 2008 R2 SP1 evens increases this likelihood as there is a lot of memory reassigning going on. Dynamic Memory actually educated a lot of Hyper-V people on what NUMA is and what to look out for. Until Dynamic Memory came on the scene, and the evangelizing that came with it by Microsoft, it was "only" the people virtualizing  SQL Server or Exchange & other big hungry application that were very aware of NUMA with its benefits and potential draw backs. If  you’re lucky the application is NUMA aware, but not all of them are, even the big names.

A Peak Into The Future

As it bears on this discussion, what is interesting that leaked screenshots from Hyper-V 3.0 or vNext  … have NUMA configuration options for both memory and CPU at the virtual machine level! See Numa Settings in Hyper-V 3.0 for a picture. So the times that you had to script WMI calls (see http://blogs.msdn.com/b/tvoellm/archive/2008/09/28/looking-for-that-last-once-of-performance_3f00_-then-try-affinitizing-your-vm-to-a-numa-node-.aspx) to assign a VM to a NUMA node might be over soon (speculation alert) and it seems like a natural progression from the ability to disable NUMA with W2K8R2SP1 Hyper-V in case you need it to avoid NUMA issues at the Hyper-V host level. Hyper-V today is already pretty NUMA aware and as such it will try to get all memory for a virtual machine from a single NUMA node and only when that can’t be done will it span across NUMA nodes. So as stated, Hyper-V with Windows Server 2008 R2 SP1 can prevent this form happening as we can disable NUMA for a Hyper-V host now. The downside is that you can’t get more memory even if it’s available on the host.

NumaSpanning

A working approach to reduce possible NUMA overhead is to limit the number of CPUs to 2 as this gives the largest amount of memory to the CPUs, in this case 50%. 4 CPUs only control 25%, etc.So with more CPU (and NUMA nodes) the risk of NUMA spanning is getting bigger very fast. For memory intensive applications scaling out is the way to go. Actually you could state that we do scale up the NUMA nodes per socket (lots of cores with the most amount of direct accessible memory possible) and as such do not scale up the server. If you can keep your virtual machines tied to a single CPU on a dual socket server to try and prevent any indirect memory access and thus a performance hit. But that won’t always work. If you ever wondered when an 8/12/16 core CPU comes in handy, well voila … here a perfect case: packing as many cores on a CPU becomes very handy when you want to limit sockets to prevent NUMA issues but still need plenty of CPU cycles. This should work as long as you can address large amounts of RAM per socket at fast speeds and the CPU internally isn’t cut up into to many multiple NUMA nodes, which would be scaling out NUMA node in the same CPU and we don’t want that or we’re back to a performance penalty.

Stacking The Deck

One way of stacking the deck in your favor is to keep the heavy apps on their own Hyper-V cluster. Then you can tweak it all you want to optimize for SQL Server, Exchange, … etc. When you throw these virtual machines in your regular clusters or for crying out loud on a VDI cluster your going to wreak havoc on the performance. Just like mixing server virtualization & VDI is a bad idea (don’t do it), throwing vCPU hungry, memory hogging servers on those cluster is just killing of performance and capacity of a perfectly good cluster. I have gotten into arguments over this as some thing one giant cluster for whatever need is better. Well no, you’ll end up micro managing placement of VM with very different needs on that cluster effectively “cutting” it up in smaller “cluster parts”. Now is separate clusters for different needs always the better approach? No, it depends, If you only have some small SQL Server needs you  get away with one nice cluster. It depends, I know, the eternal consultants answer, but I have to say it. I don’t want to get angry mails from managers because someone set up a 6 node clusters for a couple of SQL Server Express databases Winking smile There are also concepts called testing, proof of concept, etc. It’s called evidence based planning. Try it, it has some benefits that become very apparent  when you’re going to virtualize beefy SQL Server, SharePoint and Exchange servers.

How do you even know it is happening apart from empirical testing. Aha, excellent question! Take a look at the "Hyper-V VM Vid Numa Node" counter set and read this blog entry by on this subject http://blogs.msdn.com/b/tvoellm/archive/2008/09/29/hyper-v-performance-counters-part-five-of-many-hyper-vm-vm-vid-numa-node.aspx. And keep an eye on the event log for http://technet.microsoft.com/hi-in/library/dd582929(en-us,WS.10).aspx (for some reason there is no comparable entry for W2K8R2 on TechNet)

Conclusions

To conclude, all of the above people is why I’m interested in the some of the latest generation of servers. The architecture of the hardware allows for a the processor to address twice the "normal" amount of memory when you only put dual CPUs on a quad socket motherboard. The Dell PowerEdge R810 and the M910 have this and it’s called a FlexMemory Bridge and that allows more memory to be available without a performance hit. They also allow for more memory per socket at higher speeds. If you put a lot of memory directly addressable to one CPU you see a speed drop. A DELL R710 with 48 GB of RAM runs at 1033 MHZ  but put 96 GB in there and you fall back to 800 Mhz. So yes, bring on those new quad socket motherboards with just 2 sockets used, a bunch of fast direct accessible memory in a neat 2 unit server package with lost of space for NIC cards & FC HBAs if needed. Virtualization heaven :-) That’s what I want so I can give my VMs running SQL Server 2008 R2 & "Denali" (when can I call it SQL Server 2012?) a bigger amount of direct accessible memory form their NUMA node. This can be especially helpful if you need to run NUMA unaware applications like SAP or such. Testing is the way to go for knowing how well a NUMA aware hyper visor and a NUMA aware application figure out the best approach to optimize the NUMA experience together.  I’m sure we’ll learn more about this as more and more information becomes available and as technology evolves.  For now we optimize for performance with NUMA where we can, when we can with what we have :-) For Exchange 2010 (we even have virtualization support for DAG mailbox servers now as well) scaling out is easier as we have all the neatly separate roles and control just about everything down to the mail client. With SQL Server applications this is often less clear. There is a varied selection of commercial and home grown applications out there and a lot of them can’t even scale out, only up. So your mileage of what you can achieve may vary. But for resource & memory heavy applications under your control, for now, scaling out is the way to go.

Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)


This is a 2nd post in a series of 4. Here’s a list of all parts:

  1. Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
  2. Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
  3. Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
  4. Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

Introduction

In this post we continue along the train of thought we set in a previous blog post “Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)”. Let’s say you want to set up a Hyper-V cluster for SQL Server virtualization. Your business & IT manager told you the need to provide them with the best performance you can get. They follow up on that statement with a real budget so you can buy high end servers (blades or rack) and spec them out optimally for SQL Server. You take into consideration NUMA issues, vCPU:pCPU ratios, SQL memory demands, the current 4 vCPU limit in hyper-V, etc. By the way, this will be > 16vCPU with Windows Server 8, which leads me to believe the 64GB memory ceiling for virtual machines will also be broken. But for now this means that with regard to CPU & memory you’ve done all you can. That leaves only networking and IO to deal with. Now the IO is food for another & very extensive discussion, but basically you have to design that around the needs of the application(s) or you’ll be toast. The network part is what we’ll tackle here.

Without going into details, what does a Hyper-V cluster need in terms of networking?

Who/What

Function

Traffic

Connection Type

Host Management

Hyper-V host connectivity.

Relatively low bandwidth. But don’t forget about deploying VMs or backups.

Public

       

VM Network

Provides network connectivity to the VMs

Very dependent on the VMs using it.

Dedicated Hyper-V

Cluster Heartbeat

Internal cluster communication to determine the status of other cluster nodes

Not much traffic but low latency or cluster might think it’s in trouble due to dropped packets. OK to combine with CSV.

Private Cluster Network

Cluster Shared Volume (CSV)

For updating CSM metadata & scenarios where redirected I/O is required

Mostly idle. When in redirected I/O it demands high bandwidth & low latency required.

Private Cluster Network

Live Migration

Used to transfer the running VM’s from one cluster node to another

Mostly idle. When Live Migrating it demands high bandwidth & low latency required.

Private Cluster Network

Host Management: It is fine to leave this on 1Gbps, unless you have a need to deploy massive amounts of VMs or you backups are consuming all bandwidth. If so consider dedicated NICs for those roles and/or 10Gbps. Also note that you might be able to leverage your SAN for virtual machine deployment / backups.

VM Network: Use multiple “single” NICs or NIC teams to spread both the load and the risk. Remember that you can lose the host management or CSV network of a node, without affecting your virtual machine connectivity but not the virtual machine network(s). So don’t put all your eggs in one basket. So do consider multiple NICs and NIC teaming. Do remember that there are other bottle necks than bandwidth to a virtual machine running apps so don’t go completely overboard as there is no single magic bullet here for virtual machine performance. 2 or 3 will do perfectly fine. What about backups in the guest? Yes, that’s an extra burden but there are better solutions than that and if you hit and bandwidth issue with guest based backups it’s time to investigate them seriously. As you will see in these series I’m not a mincer with NIC ports but there’s no need to have one for every 2 Virtual machines. If you have really high bandwidth needs consider 10Gbps, not a truck load of NIC ports.

Heartbeat: Due to the mostly moderate needs it is often combined with the CSV traffic.

Cluster Shared Volume (CSV): Well you have the need for metadata of the clustered shared volumes. But that’s not all. You also have redirected access when you’re doing backups, defragmenting your CSV storage or when the storage paths are unavailable. So go for 10Gbps when you can, especially since this is your backup path for Live Migration traffic!

Side Note: Don’t say that Redirected Access over the CSV network will never happen when you have redundant storage paths. We’ve seen it happen in an environment with dual FC HBA cards, dual SAN controllers and the works. Redirected Access saved our service availability during that event! What happened exactly and how it all ties together is a long story and complicated but in essence an arbitrated loop management module when haywire and caused a loop, the root cause of this was a defective disk. When that event was over one of the controllers went nuts and decided this wasn’t his cup of tea and called it a day. Guess what? Some servers could not failover to the other controller as something went wrong in the internal workings of the SAN itself, dual HBA didn’t help here. How did our services stay available? Thanks to Redirected Access. It was at 1Gbps speeds so that hurt a little but we kept ‘m running. Our vendor worked through this with us but things where pretty bad and it was pucker time. However this is one example where we kept our services running for 24 hours (whilst working at the issue with the vendor) via redirected access. The bad thing was we needed to take the spare controller of line & restart both to get the replacement controller to be recognized, yes a complete shutdown of the cluster nodes to restart both SAN controllers. I still remember the mail I send and the call I made to management that is was shutting down the business for 30 minutes. But it was not because of Hyper-V, quite the opposite; it helped us out a lot!

Also note that when you run software VSS based backups and disk defragmentation on your CSV storage you’ll be running in Redirected Access mode. Also see http://workinghardinit.wordpress.com/2011/06/02/some-feedback-on-how-to-defrag-a-hyper-v-r2-cluster-shared-volume/ Some Feedback On How to defrag a Hyper-V R2 Cluster Shared Volume

Live Migration: The bigger and better the pipe the faster Live Migration gets done. With high density or resource (memory) intensive servers this becomes a lot more important. Think of SQL Server, Exchange consuming 16, 24, 32 or more GB of memory. So do consider 10Gbps.

iSCSI: As we are using Fiber Channel in our SAN we did not include iSCSI in the networking needs table above. Now I do want to draw your attention to the need for iSCSI in the virtual machines themselves. This is needed for clustering within the virtual machines. Today this is almost a requirement as clustering in the guest becomes more and more important. You’ll need at least two NIC ports in production for this, if possible in on two separate cards for ultimate redundancy. Now as a best practice we won’t share the iSCSI NICS between the hosts and the guests. I do this in the lab but won’t have it in production. So that could mean at least two more NIC ports. With 10Gbps you’ll have ample performance but depending on your IO needs you might want 4 if you’re using 1Gbps so those NIC numbers are rising fast.

What

Function

Traffic

Connection Type

iSCSI Guest

Virtual machine shared storage.

High bandwidth need, low latency is required to get good I/O

Dedicated to Hyper-V

iSCSI host

Host shared storage

High bandwidth need, low latency is required to get good I/O

Excluded from cluster, dedicated to the host.

What to move to 10Gbps?

Cool, you think, let’s throw some 10Gbps NICs & switches into our network. After that, depending on the rest of your network equipment & components, your virtual machines might be able to talk to other virtual and physical servers on the network at speeds up to 10Gbps or at least 1Gbps. I kind of hope that none of you are running 100 Mbps in your server racks today. And last but not least, with your 10Gbps network you’ll be able to do get the best performance for your CSV and Live Migration traffic. Life is good!

Until your network engineer hears about your plans. All of a sudden it’s no so cool anymore. You certainly woke the network people up! They’re nervous now they have seen all the double (redundancy) lines you’ve drawn on your copy of the schema representing the rack / server room network. They start mumbling things about redundancy, loops, RSTP, MSTP, LAG, stacking and a boatload other acronyms that sound like you’ve heard ‘m before but can’t quite place. They also talk about doom and gloom scenarios that might very well bring down the network. So unless you are the network admin you should dust of your communication skills and get them on board. So for your sake I hope they’re not the kind of engineers that states that most network problems that can’t be solved by removing servers and applications that ruin the nirvana of their network design. If so they’ll be vary weary of that “virtual switch” you’re talking about as well.

The Easy Way Out – A Dedicated CSV & Live Migration Network

Let’s say that you need a lot more time to get to a fully integrated solution for the 10Gbps network architecture figured out and set up. But your manager states you need to improve the Live Migration and other cluster network speeds today. What are your options? Based on the above information your boss is right, the networks that will benefit the most from a move to 10Gbps are CSV and Live Migration (and Heart Beat that piggy backs along with CSV). Now you have to remember that those cluster networks (subnets/VLANs) are for the Heart Beat, CSV and Live Migration cluster traffic only. So basically the only requirement you have is that these run on separate subnets/VLANs (to present them as distinct networks to your failover cluster) and that every node of the cluster can communicate over those subnets/VLANs. This means that you can leave the switches for those networks completely isolated from the rest of the network as shown in the picture below. I used some very common and often used DELL PowerConnect switches (5424, 6248, 8024F) in some scenario drawings for this blog series. They could make that 8024F an unbeatable price/quality deal if they would make them stackable. The sweet thing about stackable switches is that you can do Active-Active NIC teaming across switches rather than active-passive. I never went that way as I’m waiting to see what virtual switch innovations Hyper-V 3.0 will bring us. You see I’m a little cheap after all

But naturally, feel free think about these scenarios with your preferred ProCurves, CISCO, Juniper, NetGear … switches in mind. Smile

clip_image002[1]

Suddenly things are cool again. The network people get time to figure out an integrated & complete long term solution and you can provide you nodes with 10Gbps for cluster only traffic. By a couple of 10Gbps switches & NICs and you’re on your way. Is this a good idea? I can’t make that call for you. I just provide some ideas. You decide.

The Case For Physically Isolating Them

Now you might wonder if this isn’t very wasteful in resources. Well not necessarily. If your cluster is big enough, let’s say 12-16 nodes or if you have a couple of clusters (4 clusters with 5 nodes for example) this might be not overly expensive. Unless you’re on a converged network, you do (I hope) the same for your storage networks, isolate them that is. You have to when you’re using fiber and you’d better do it when using iSCSI. It provides for the best performance and less complex switch configurations. Remember I mentioned that high availability requires some complexity. Try to keep that complexity as low as possible and when you introduce complexity make sure you can manage it. This serves two purposes. One is making sure that the complexity doesn’t ruin you high availability and two is that you’ll be happy you did it when it comes to trouble shooting and fixing issues. Now you might say that this ruins the concept of converged networks. Academically this is true but when you are filling up ports on switches for a single purpose there is no room for anything else anyway. Don’t lose sight of the aim of a converged network. That is to have the ability to use the same hardware/technology when possible for multiple needs. This gives you options and capabilities where and when needed. It’s not about always using all technology and protocols on each and every switch. Don’t forget also that you’ll need to address QOS/Performance on a converged network per type of traffic. There is also the fact that in brown field scenario’s you’re dealing with replacing a part of the infrastructure and this example is a good way to get 10Gbps where needed and not making any change on the existing network infrastructure. This reduces risk and impact. As a matter effect if you plan this right you can do this without service interruption. That means going node by node (maintenance mode, evacuate all VMs), moving the CSV network first for example and only then the Live Migration network. You’re leveraging the ability of the cluster networks to take on each other’s role here to achieve this.

Another good reason to physically isolate the networks is for security. There was an exploit for manipulating VMs during live migrations in 2008 (http://www.eecs.umich.edu/techreports/cse/2007/CSE-TR-539-07.pdf). You can protect against this via very careful switch configuration and VLAN design. But isolating the switches is very easy, clean and effective as well. Overkill? I don’t know, but perhaps not if you do works for intelligence agencies.

Ethernet Out-of-Band (OOB) Port For Management

Don’t forget you still need to be able to manage those switches but today, in this class of equipment you get an Ethernet Out-of-Band (OOB) port for that. This one you can safely uplink to your regular management network. So if you really don’t need communication with the rest of the network you have no functional reason not to isolate them.

Money, Cost? No Value!

Still you think, isn’t this very expensive? Well look at the purpose. Manageable complexity, high availability and your management stated to eliminate, where possible, any limitation on performance and approved the budget for it all. Put this into perspective. The SQL Server data center editions running on these clusters, combined with the cost of development & maintenance of the databases and applications relying on this infrastructure put those extra € spent on a couple of switches really into perspective. On top of that you’re not wasting those switches. When the network people get their plans finished they’ll be integrated into the final solution if still needed and possible. Don’t forget that you might use all ports for just cluster traffic depending on the number of hosts you have! So even without integrating them into the rest of the network, you’re still getting very solid results. On top of that, sometimes you get to build solutions where budget is not the first, last and only concern. Sweet! I do know some people who’ll call me a money wasting nut case J. But get real, when you’re building high available, highly performing failover clusters and you’re in a discussion about the cost of a couple of NIC ports and you are going to adjust your design over that, perhaps you have a sponsorship issue. Put in into perspective. Hyper-V cluster are not a competition where the one who uses the least NIC ports/cards and switch ports/ switches wins. That’s why it hurts when I see designs like this claiming victory:

image

What I want to see is more like this:

image

But that will never fit into a blade design! Really? Have you seen the blades like the DELL M910? It’s a beast, comparable to the R810. It’s was the first blade I really felt like buying. Cisco also entered that market with guns drawn and is pushing HP to keep performing. So Again put the NIC/Switch and NIC port : Switch Port count into perspective against what you’re trying to achieve. To quote Anton Ego “… you know what I’m craving? A little perspective, that’s it. I’d like some fresh, clear, well-seasoned perspective.”