I Can’t Afford 10GBps For Hyper-V And Other Lies


You’re wrong

There, I said it. Sure you can. Don’t think you need to be a big data center to make this happen. You just need to think and work outside the box a bit and when you’re not a large enterprise, that’s a bit more easy to do. Don’t do it like a big name brand, traditionalist partner would do it (strip & refit the entire structural cabling in the server room, high end gear with big margins everywhere). You’re going for maximum results & value, not sales margins and bonuses.

I would even say you can’t afford to stay on 1Gbps much longer or you’ll be dealing with the fall out of being stuck in the past. Really some of us are already look at > 10Gbps connections to the servers, actually. You need to move from 1Gbps or you’ll be micro managing a way around issues sucking all the fun out of your work with ever diminishing results and rising costs for both you and the business.

Give your Windows Server 2012R2 Hyper-V environment the bandwidth it needs to shine and make the company some money. If all you want to do is to spent as little money as possible I’m not quite sure what your goal is? Either you need it or you don’t.  I’m convinced we need it. So we must get it. Do what it takes. Let me show you one way to get what you need.

Sounds great what do I do?

Take heart, be brave and of good courage! Combine it with skills, knowledge & experience to deliver a 10Gbps infrastructure as part of ongoing maintenance & projects. I just have to emphasize that some skills are indeed needed, pure guts alone won’t do it.

First of all you need to realize that you do not need to rip and replace your existing network infrastructure. That’s very hard to get approval for, takes too much time and rapidly becomes very expensive in both dollars and efforts. Also, to be honest, quiet often you don’t have that kind of pull. I for one certainly do not. And if I’d try to do that way it takes way too many meetings, diplomacy, politics, ITIL, ITML & Change Approval Board actions to make it happen. This adds to the cost even more, both in time and money. So leave what you have in place, for this exercise we assume it’s working fine but you can’t afford to have wait for many hours while all host drains in 6 node cluster and you need to drain all of them to add memory. So we have a need (OK you’ll need a better business case than this but don’t make to big a deal of it or you’ll draw unwanted attention) and we’ve taking away the fear factor of fork lift replacing the existing network which is a big risk & cost.

So how do I go about it?

Start out as part of regular upgrades, replacement or new deployments. The money is their for those projects. Make sure to add some networking budget and leverage other projects need to support the networking needs.

Get a starter budget for a POC of some sort, it will get your started to acquire some more essential missing  bits.

By reasonably cheap switches of reasonable port count that do all you need. If they’re readily available in a frame work contract, great. You can get it as part of the normal procedures. But if you want to nock another 6% to 8% of the cost order them directly from the vendor. Cut out the middle man.

Buy some gear as part of your normal refresh cycle. Adapt that cycle life time a bit to suit your needs where possible. Funding for operation maintenance & replacement should already be in place right?

Negotiate hard with your vendor. Listen, just like in the storage world, the network world has arrived at a point where they’re not going to be making tons of money just because they are essential. They have lots of competition and it’s only increasing. There are deals to be made and if you chose the right hardware it’s gear that won’t lock you into proprietary cabling, SPF+ modules and such. Or not to much anyway Smile.

Design options and choices

Small but effective

If you’re really on minimal budget just introduce redundant (independent) stand alone 10Gbps switches for the East-West traffic that only runs between the nodes in the data center. CSV, Live Migration, backup. You don’t even need to hook it up to the network for data traffic, you only need to be able to remotely manage it and that’s what they invented Out Off Band (OOB) ports for. See also an old post of mine Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4). In the smallest cheapest scenario I use just 2 independent switches. In the other scenario build a 2 node spine and the leaf. In my examples I use DELL network gear. But use whatever works best for your needs and your environment. Just don’t go the “nobody ever got fired for buying XXX” route, that’s fear, not courage! Use cheaper NetGear switches if that fits your needs. Your call, see my  recent blog post on this 10Gbps Cheap & Without Risk In Even The Smallest Environments.

Medium sized excellence

First of all a disclaimer: medium sized isn’t a standardized way of measuring businesses and their IT needs. There will be large differences depending on you neck of the woods Smile.

Build your 10Gbps infrastructure the way you want it and aim it to grow to where it might evolve. Keep it simple and shallow. Go wide where you need to. Use the Spine/Leaf design as a basis, even if what you’re building is smaller than what it’s normally used for. Borrow the concept. All 10Gbps traffic, will be moving within that Spine/Leaf setup. Only client server traffic will be going out side of it and it’s a small part of all traffic. This is how you get VM mobility, great network speeds in the server room avoiding the existing core to become a bandwidth bottleneck.

You might even consider doing Infiniband where the cost/Gbps is very attractive and it will serve you well for a long time. But it can be a hard sell as it’s “another technology”.

Don’t panic, you don’t need to buy a bunch of Nexus 7000’s  or Force10 Z9000 to do this in your moderately sized server room. In medium sized environment I try to follow the “Spine/Leaf” concept even if it’s not true ECMP/CLOSS, it’s the principle. For the spine choose the switches that fit your size, environment & growth. I’ve used the Force10 S4810 with great success and you can negotiate hard on the price. The reasons I went for the higher priced Force10 S4810 are:

  • It’s the spine so I need best performance in that layer so that’s where I spend my money.
  • I wanted VLT, stacking is a big no no here. With VLT I can do firmware upgrades without down time.
  • It scales out reasonably by leveraging eVLT if ever needed.

For the ToR switches I normally go with PowerConnect 81XX F series or the N40XXF series, which is the current model. These provide great value for money and I can negotiate hard on price here while still getting 10Gbps with the features I need. I don’t need VLT as we do switch independent NIC teaming with Windows. That gives me the best scalability wit DVMQ & vRSS and allows for firmware upgrades without any network down time in the rack. I do sacrifice true redundant LACP within the rack but for the few times I might really need to have that I could go cross racks & still maintain a rack a failure domain as the ToRs are redundant. I avoid stacking, it’s a single point of failure during firmware upgrades and I don’t like that. Sure I can could leverage the rack a domain of failure to work around that but that’s not very practical for ordinary routine maintenance. The N40XXF also give me the DCB capabilities I need for SMB Direct.

Hook it up to the normal core switch of the existing network, for just the client/server.(North/South) traffic. I make sure that any VLANs used for CSV, live migration, can’t even reach that part of the network.  Even data traffic (between virtual machines, physical servers) goes East-West within your Spine/Leave and never goes out anyway unless you did something really weird and bad.

As said, you can scale out VLT using eVLT that creates a port channel between 2 VLT domains. That’s nice. So in a medium sized business you’re pretty save in growth. If you grow beyond this, we’ll be talking about a way larger deployment anyway and true ECMP/CLOS and that’s not the scale I’m dealing with where. For most medium sized business or small ones with bigger needs this will do the job. ECMP/CLOS Spine/leaf actually requires layer 3 in the design and as you might have noticed I kind if avoid that. Again, to get to a good solution today instead of a real good solution next year which won’t happen because real good is risky and expensive. Words they don’t like to hear above your pay grade.

The picture below is just for illustration of the concept. Basically I normally have only one VLT domain and have two 10Gbps switches per rack. This gives me racks as failure domains and it allows me to forgo a lot of extra structural cabling work to neatly provide connectivity form the switches  to the server racks .image

You have a  scalable, capable & affordable 10Gbps or better infrastructure that will run any workload in style.. After testing you simply start new deployments in the Spine/Leaf and slowly mover over existing workloads. If you do all this as part of upgrades it won’t cause any downtime due to the network being renewed. Just by upgrading or replacing current workloads.

The layer 3 core in the picture above is the uplink to your existing network and you don’t touch that. Just let if run until there nothing left in there and you can clean it up or take it out. Easy transition. The core can be left in place or replaces when needed due to age or capabilities.

To keep things extra affordable

While today the issues with (structural) 10Gbps copper CAT6A and NICs/Switches seem solved, when I started doing 10Gbps fibre cabling of Copper Twinax Direct Attach was the only way to go. 10GBaseT wasn’t an option yet and I still love the flexibility of fibre, it consumes less space and weighs less then CAT6A. Fibre also fits easily in existing cable infrastructure. Less hassle. But CAT6A will work fine today, no worries.

If you decide to do fibre, buy OM3, you can get decent, affordable cabling on line. Order it as consumable supplies.

Spend some time on the internet and find the SFP+ that works with your switches to save a significant amount of money. Yup some vendor switches work with compatible non OEM branded SPF+ modules. Order them as consumable supplies, but buy some first to TEST! Save money but do it smart, don’t be silly.

For patch cabling 10Gbps Copper Twinax Direct Attach works great for short ranges and isn’t expensive, but the length is limited and they get thicker & more sturdy and thus unwieldy by length. It does have it’s place and I use them where appropriate.

Isn’t this dangerous?

Nope. Technology wise is perfectly sound and nothing new. Project wise it delivers results, fast, effective and without breaking the bank. Functionally you now have all the bandwidth you need to stop worrying and micromanaging stuff to work around those pesky bandwidth issues and focus on better ways of doing things. You’ve given yourself options & possibilities. Yay!

Perhaps the approach to achieve this isn’t very conventional. I disagree. Look, anyone who’s been running projects & delivering results knows the world isn’t that black and white. We’ve been doing 10Gbps for 4 years now this way and with (repeated) great success while others have to wait for the 1Gbps structural cabling to be replaced some day in the future … probably by 10Gbps copper in a 100Gbps world by the time it happens. You have to get the job done. Do you want results, improvements, progress and success or just avoid risk and cover your ass? Well then, choose & just make it happen. Remember the business demands everything at the speed of light, delivered yesterday at no cost with 99.999% uptime.  So this approach is what they want, albeit perhaps not what they say.

10Gbps Cheap & Without Risk In Even The Smallest Environments


Over the last 18 months cheaper, commodity, small port count, but high quality 10Gbps switches have become available. NetGear is a prime example. This means 10Gbps networking is within reach for even the smallest deployments.

Size is an often used measure for technological needs like storage, networking and compute but in many cases it’s way too blunt of a tool. A lot of smaller environments in specialized niches need more capable storage  and networking capacities than their size would lead you to believe. The “Enterprise level” cost associated with the earlier SPF+ based swithes was an obstacle especially since the minimum port count lies around 24 ports, so with switch redundancy this already means 2 *24 ports.  Then there’s the cost of vendor branded SPF+ modules. But that could be offset with Copper Twinax Direct Attach cabling (which have their sweet spots for use) or finding functional cheaper non branded SFP+ modules. But all that isn’t an issue anymore. Today 10GBase-T card & switches are readily available and ready for prime time. The issues with power consumption and heat have been dealt with.

While vendors like DELL have done some amazing work to bring affordable 10Gbps switches to the market it remained a obstacle for many small environments. Now with the cheaper copper based, low port count switches it’s become a lot easier to introduce 10Gbps while taking away the biggest operational pains.

  • You can start with a lower number of 10Gbps ports (8-12) instead of  a minimum of 24.
  • No need for expensive vendor branded SPF+ modules.
  • Copper cabling (CAT6A) is relatively cheap for use in a rack or between two racks and for this kind of environment using patch lead cables isn’t an issue
  • Power consumption and heat challenges of copper 10Gbps has been addressed.

8port10Gbps

So even for the smallest setups where people would love to get 10Gbps for live migrations, hypervisor host backups and/or the virtual network it can be done now. If you introduce these for just CSV, live migration, storage or backup networks you can even avoid having to integrate them into the data network. This makes it easier, non disruptive & the isolation helps puts minds at easy about potential impacts of extra traffic and misconfigurations. Still you take away the heavy loads that might be disrupting your 1Gbps network, making things well again without needing further investments.

So go ahead, take the step and enjoy the benefits that 10Gbps bring to your (virtual) environment. Even medium sized shops can use this as a show case while they prepare for a 10Gbps upgrade for the server room or data center in the years to come.

Still Need To Optimizing Power Settings On DELL 12th Generation Servers For Lightning Fast Hyper-V Live Migrations?


Do you remember my blog from 2011 on optimizing some system settings to get way better Live Migration performance with 10Gbps NICs?  It’s over here Optimizing Live Migrations with a 10Gbps Network in a Hyper-V Cluster. This advice still holds true, but the power optimization settings & interaction between DELL Generation 12 Server and with Windows Server 2012 has improved significantly. Where with Windows Server 2008 R2 we could hardly get above 16% bandwidth consumption out of the box with Live Migration over a 10Gbps NIC today this just works fine.

Don’t believe me?image You do now? A cool Winking smile

For overall peak system performance you might want to adjust your Windows configuration settings to run the High Performance preferred power plan, if that’s needed.image

You might do no longer need to dive into the BIOS. Of cause if you have issues because your hardware isn’t that intelligent and/or are still running Windows 2008 R2 you do want to there. As when it comes to speed we want it all and we want it now Smile and than you still want to dive into the BIOS and tweak it even on the DELL 12th Generation hardware. Test & confirm I’d say but you should notice a difference, all be it smaller than with Windows Server 2008 R2.

Well let’s revisit this again as we are now no longer working with Generation 10 or 11 servers with an “aged” BIOS. Now we have decommissioned the Generation 10 server,  upgrade the BIOS of our Generation 11 Servers and acquired Generation 12 servers. We also no use UEFI for our Hyper-V host installations. The time has come to become familiar with those and the benefits they bring. It also future proofs our host installations.

So where and how do I change the power configuration settings now? Let’s walk through one together. Reboot your server and during the boot cycle hit F2 to enter System Setup.image

Select System BIOSimage

Click on System Profile Settingsimage

The settings you want to adapt are:

  • CPU Power Management should be on Maximum Performance
  • Setting Memory Frequency to “Maximum Performance”
  • C1E states should be disabled
  • C states should be disabled

image

That’s it. The below configuration has optimized your power settings on a DELL Generation 12 server like the R720.image

When don, click “Back” and than Finish. A warning will pop up and you need to confirm you want to safe your changes. Click “Yes” if you indeed want to do this.image

You’ll get a nice confirmation that your settings have been saved. Click “OK” and then click Finish.image

Confirm that you want to exit and reboot by clicking yes and voila, when the server comes back on it will be running a full speed at the cost of more power consumption, extra generated heat and cooling.image

Remember, if you don’t need to run at full power, don’t. And if you consider using  Dynamic Optimization and Power Optimization in System Center Virtual Machine Manager 2012. Save a penguin!

Complete VM Mobility Across The Data Center with SMB 3.0, RDMA, Multichannel & Windows Server 2012 (R2)


Introduction

The moment I figured out that Storage Live Migration (in certain scenarios) and Shared Nothing Live Migration leverage SMB 3.0 and as such Multichannel and RDMA in Windows Server 2012 I was hooked. I just couldn’t let go of the concept of leveraging RDMA for those scenarios.  Let me show you the value of my current favorite network design for some demanding Hyper-V environments. I was challenged a couple of time on the cost/port of this design which is, when you really think of it, a very myopic way of calculating TCO/ROI. Really it is. And this week at TechEd North America 2013 Microsoft announced that all types of Live Migrations support Multichannel & RDMA (next to compression) in Windows Server 2012 R2.  Watch that in action at minute 39 over here at Understanding the Hyper-V over SMB Scenario, Configurations, and End-to-End Performance. You should have seen the smile on my face when I heard that one! Yes standard Live Migration now uses multiple NIC (no teaming) and RDMA for lightning fast  VM mobility & storage traffic. People you will hit the speed boundaries of DDR3 memory with this! The TCO/ROI of our plans just became even better, just watch the session.

So why might I use more than two 10Gbps NIC ports in a team with converged networking for Hyper-V in Windows 2012? It’s a great solution for sure and a combined bandwidth of 2*10Gbps is more than what a lot of people have right now and it can handle a serious workload. So don’t get me wrong, I like that solution. But sometimes more is asked and warranted depending on your environment.

The reason for this is shown in the picture below. Today there is no more limit on the VM mobility within a data center. This will only become more common in the future.

image

This is not just a wet dream of virtualization engineers, it serves some very real needs. Of cause it does. Otherwise I would not spend the money. It consumes extra 10Gbps ports on the network switches that need to be redundant as well and you need to have 10Gbps RDMA capable cards and DCB capable switches.  So why this investment? Well I’m designing for very flexible and dynamic environments that have certain demands laid down by the business. Let’s have a look at those.

The Road to Continuous Availability

All maintenance operations, troubleshooting and even upgraded/migrations should be done with minimal impact to the business. This means that we need to build for high to continuous availability where practical and make sure performance doesn’t suffer too much, not noticeably anyway. That’s where the capability to live migrate virtual machines of a host, clustered or not, rapidly and efficiently with a minimal impact to the workload on the hosts involved comes into play.

Dynamics Environments won’t tolerate downtime

We also want to leverage our resources where and when they are needed the most. And the infrastructure for the above can also be leveraged for that. Storage live migration and even Shared Nothing Live Migration can be used to place virtual machine workloads where they are getting the resources they need. You could see this as (dynamically) optimizing the workload both within and across clusters or amongst standalone Hyper-V nodes. This could be to a SSD only storage array or a smaller but very powerful node or cluster in regards to CPU, memory and Disk IO. This can be useful in those scenarios where scientific applications, number crunching or IOPS intesive  software or the like needs them but only for certain times and not permanently.

Future proofing for future storage designs

Maybe you’re an old time fiber channel user or iSCSI rules your current data center and Windows Server 2012 has not changed that. But that doesn’t mean it will not come. The option of using a Scale Out File Server and leverage SMB 3.0 file shares to providing storage for Hyper-V deployments is a very attractive one in many aspects. And if you build the network as I’m doing you’re ready to switch to SMB 3.0 without missing a heart beat. If you were to deplete the bandwidth x number of 10Gbps can offer, no worries you’ll either use 40Gbps and up or Infiniband. If you don’t want to go there … well since you just dumped iSCSI or FC you have room for some more 10Gbps ports Smile

Future proofing performance demands

Solutions tend to stay in place longer than envisioned and if you need some long levity and a stable, standard way of doing networking, here it is. It’s not the most economical way of doing things but it’s not as cost prohibitive as you think. Recently I was confronted again with some of the insanities of enterprise IT. A couple of network architects costing a hefty daily rate stated that 1Gbps is only for the data center and not the desktop while even arguing about the cost of some fiber cable versus RJ45 (CAT5E). Well let’s look beyond the North – South traffic and the cost of aggregating band all the way up the stack with shall we? Let me tell you that the money spent on such advisers can buy you in 10Gbps capabilities in the server room or data center (and some 1Gbps for the desktops to go) if you shop around and negotiate well. This one size fits all and the ridiculous economies of scale “to make it affordable” argument in big central IT are not always the best fit in helping the customers. Think  a little bit outside of the box please and don’t say no out of habit or laziness!

Conclusion

In some future blog post(s) we’ll take a look at what such a network design might look like and why. There is no one size fits all but there are not to many permutations either. In our latest efforts we had been specifically looking into making sure that a single rack failure would not bring down a cluster. So when thinking of the rack as a failure domain we need to spread the cluster nodes across multiple racks in different rows. That means we need the network to provide the connectivity & capability to support this, but more on that later.

SMB Direct RoCE Does Not Work Without DCB/PFC


Introduction

SMB Direct RoCE Does Not Work Without DCB/PFC. “Yes”, you say, “we know, this is well documented. Thank you.” but before you sign of hear me out.

Recently I plugged to RoCE cards into some test servers and linked them to a couple of 10Gbps switches. I did some quick large file copy testing and to my big surprise RDMA kicked in with stellar performance even before I had installed the DCB feature, let alone configure it. So what’s the deal here. Does it work without DCB? Does the card fail back to iWarp? Highly unlikely. I was expecting it to fall back to plain vanilla 10Gbps and not being used at all but it was. A short shout out to Jose Barreto to discuss this helped clarify this.

DCB/PFC is a requirement RoCE

The more busy the network gets the faster the performance will drop. Now in our test scenario we had two servers  for a total of 4 RoCE ports on the network consisting of a beefy 48 port 10Gbps switches. So we didn’t see the negative results of this here.

DCB (Data Center Bridging) and Priority Flow Control are considered a requirement for any kind of RoCE deployment. RDMA with RoCE operates at the Ethernet layer. That means there is no overhead from TCP/IP, which is great for performance. This is the reason you want to use RDMA actually. It also means it’s left on it’s own to deal with Ethernet-level collisions and errors. For that it needs DCB/PFC other wise you’ll run into performance issues due to a ton of retries at the higher network layers.

The reason that iWarp doesn’t require DCB/PCF is that it works at the TCP/IP level also offloaded by using a TCP/IP stack on the NIC instead of the OS. So errors are handled by TCP/IP at a cost: iWarp results in the same benefits as RoCE but it doesn’t scale that well. Not that iWarp performance is lousy, far form! Mind you, for bandwidth management reasons,you’d be better of using DCB or some form of QoS as well.

Conclusion

So no, not configuring  DCB on your servers and the switches isn’t an option, but apparently it isn’t blocked either so beware of this. It might appear to be working fine but it’s a bad idea. Also don’t think it defaults back to iWarp mode, it doesn’t, as one card does one thing not both. There is no shortcut. RoCE RDMA does not work error free out of the box so you do have the install the DCB feature and configure it together with the switches.

SMB 3.0 Multichannel Auto Configuration In Action With RDMA / SMB Direct


Most of you might remember this slide by Jose Barreto on SMB Multichannel  Auto Configuration in one of his many presentations:image

  • Auto configuration looks at NIC type/speed => Same NICs are used for RDMA/Multichannel (doesn’t mix 10Gbps/1Gbps, RDMA/non-RDMA)
  • Let the algorithms work before you decide to intervene
  • Choose adapters wisely for their function

You can fine tune things if and when needed (only do this when this is really the case) but let’s look at this feature in action.

So let’s look at this in real life. For this test we have 2 * X520 DA 10Gbps ports using 10.10.180.8X/24 IP addresses and 2 * Mellanox  10Gbps RDMA adaptors with 10.10.180.9X/24 IP addresses. No teaming involved just multiple NIC ports. Do not that these IP addresses are on different subnet than the LAN of the servers. Basically only the servers can communicate over them, they don’t have a gateway, no DNS servers and are as such not registered in DNS either (live is easy for simple file sharing).

image

Let’s try and copy a 50Gbps fixed VHDX file from server1 to server2 using the DNS name of the target host (pixelated), meaning it will resolve to that host via DNS and use the LAN IP address 10.10.100.92/16 (the host name is greyed out). In the below screenshot you see that the two RDMA capable cards are put into action. The servers are not using  the 1Gbps LAN connection. Multichannel looked at the options:

  • A 1Gbps RSS capable Link
  • Two 10Gbps RSS capable Links
  • Two 10Gbps RDMA capable links

Multichannel concluded the RDMA card is the best one available and as we have two of those it use both. In other words it works just like described.

image

Even if we try to bypass DNS and we copy the files explicitly via the IP address (10.10.180.84)  assigned to the Intel X520 DA cards Multichannel intelligence detects that it has two better cards  that provide RDMA available and as you can see it uses the same NICs  as in the demo before.  Nifty isn’t it Smile

 image

If you want to see the other NICs in action we can disable the Mellanox card and than Multichannel will choose the two X520 DA cards. That’s fine for testing but in real life you need a better solution when you need to manually define what NICs can be used. This is done using PowerShell Smile (take a look at Jose Barrto’s blog The basics of SMB PowerShell, a feature of Windows Server 2012 and SMB 3.0  for more info).

New-SmbMultichannelConstraint –ServerName SERVER2 –InterfaceAlias “SLOT 6 Port 1”, “SLOT 6 Port 2”

This tells a server it can only use these two NICs which in this example are the two Intel X520 DA 10Gbps cards to access Server2. So basically you configure/tell the client what to use for SMB 3.0 traffic to a certain server. Note the difference in send/receive traffic between RDMA/Native 10Gbps.

On Server1, the client you see this:

image

On Server2, the server you see this:

image

Which is indeed the constraint set up as we can verify with:

Get-SmbMultichannelConstraint

image

We’re done playing so let’s clean up all the constraints:

Get-SmbMultichannelConstraint | Remove-SmbMultichannelConstraint

image

Seeing this technology it’s now up to the storage industry to provide the needed  capacity and IOPS I a lot more affordable way. Storage Spaces have knocked on your door, that was the wake up call Winking smile. In an environment where we throw lots of data around we just love SMB 3.0

DELL PowerConnect 8024F Is Now Stackable


A colleague pointed me the latest firmware update (4.2.0.4) for the DELL PowerConnect 8024F switches. As I was reading the release notes one item in particular caught my attention. The PowerConnect 8024/8024f/M8024-k switches are now stackable. You can put up to 6 switches in one stack using the regular front ports (SFP+). You might remember form a previous blog post on 10Gbps, Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4), where I mentioned that we got a great deal on those switches. I also mentioned that the only thing lacking in these switches and what would make this the best 10Gbps switch when comparing value for money is the ability to stack them. I quote myself:

“They could make that 8024F an unbeatable price/quality deal if they would make them stackable.”

I’ve been called visionary before but I won’t go into that that insider joke right now Winking smile. Now it’s for sure that not just my little blog post that made this update happen but it is a nice New Year’s gift. More features & options with hardware you already own is always nice. So I guess a lot of people have made the same observation, both customers & DELL themselves. You could just “smell” by the available command & configuration that this switch could be made stackable and they did.

Is Ethernet based stacking perfect? No (there is very little perfection in this world). The biggest drawback, if you need that feature,  is the fact that you can hot plug the stacking links. But for all other practical purposes it’s a nice deal. Why? Well, now that these switches supports Ethernet based stacking you will be able to choose more types of NIC Teaming to use for your servers. That means those teaming configurations that are dependent on stacking, such as for active-active NIC Teaming across two switches to be more precise. I find this pretty good news.

You all know I’m very enthusiastic to use the NIC Teaming build into Windows 8 and I will use it where and when I can. But there will be for many years to come a lot of Windows 2008 R2 systems to support and install. So it’s always good to see your hardware vendors improving their gear to give you more options. For the pricing I got on the 8024F in the last project and the needs of the solution we could deal with not being able to stack. Stacking via Ethernet using other switches was way more expensive, not even to mention the ones using stacking module ones (real premium pricing). So we got the best deal for our needs.

For 10Gbps switches stacking over Ethernet give you up to 80Gbps with a maximum of 8 uplinks so bandwidth is not as much a concern. With 1Gbps switches it is, which makes stacking modules the only way to go there I think. If you need massive bandwidth and you probably do. The drawback, as with all forms of inter switch links (a LAG for example) is that this method means you’re losing ports for other purposes. But you need to look at your needs and do the math. I think buying with investment protection is good but don’t always buy in preparation for the time you’ll become a fortune 500 company. That takes a while and in the mean while you’ll be very well served anyway.

Another related feature that’s new is Nonstop Forwarding (NSF). NSF allows the forwarding plane of stack units to continue forwarding packets even while the control and management planes restart. This could be a power failure, some hardware of software error or even an upgrade. This feature is common to all stackable switches as far as I know and is needed. Not that ‘m saying the redundant loop in stack is bad or overkill, far from it, but that takes care of other scenarios that NSF is designed to handle.

Carsten Rachfahl interviews me on 10Gbps networking with Hyper-V


I was in Frankfurt Germany last week at The Experts Conference 2011 Europe (http://www.theexpertsconference.com/europe/2011/). I met up with a lot of great IT professionals from the on line community as you can read in a previous blog post . The video interview that was mentioned in that post is now on line. It’s a very good quality one and Carsten Rachfahl made an excellent interviewer who manages to make the entire process both educational and enjoyable.

The results are available for all to view right here on Carsten’s site (in German) where you can also find his other blog posts http://www.hyper-v-server.de/videos/videointerview-mit-didier-van-hoye-zu-hyper-v-und-10gbit/

Optimizing Live Migrations with a 10Gbps Network in a Hyper-V Cluster


Introduction

You’ll find the following recommendations on line about optimizing Live Migrations:

  1. Use bigger pipes (10Gbps is better than 1Gbps)
  2. Enable Jumbo Frames
  3. Up the Receive Buffer to 8192 (Exchange 2010 virtualization recommendation for Live Migration)

As we’ve been building Hyper-V Cluster since the early betas let me share some experiences with this. For the curious I used Intel® Ethernet X520 SFP+ Direct Attach Server Adapters & DELL PowerConnect 8024F 10Gbps switches for my testing. See my blog posts on considerations about the use of 10Gbps in Hyper-V clusters here:

  1. Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
  2. Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
  3. Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
  4. Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

Bigger pipes are better

On bigger pipes I can only say that if you can afford them and need them you should get them. End of discussion.

Jumbo frames rock

Jumbo frames help out a lot (+/- 20 %), especially with the larger memory virtual machines.

The golden nugget

So far so good, but there is one golden nugget of information I want to share. There is little trip wire that can prevent you from getting your optimal performance. Advanced power settings in the BIOS. If you read my blogs you might have come across a blog post Consider CPU Power Optimization Versus Performance When Virtualizing and I encourage you to go and read that post as it holds a lot of good info but also is very relevant to this post. Because we have yet another reason to make sure your BIOS is set right to achieve a decent return on investment in quality hardware.

In our experience those power saving settings, the C states and the C1 states are also not very helpful when it comes to Live Migration & such. I got from a meager 20% bandwidth use all the way up to 35-45% at best with jumbo frames enabled and the power settings set to ”Full Power”. A lot better but still not very impressive.

clip_image002

Now go ahead and disable the C states AND the C1E state to achieve 55% to 65%.

clip_image004

Now the speed of a live migration varies greatly between virtual machines that are idle or running a full load, both CPU & memory wise. It also depends on the load the host you’re migrating from and to, but this impact is less when you disable those advance CPU power settings.

Look at the following screen shots

clip_image005

A SQL Server with 50GB of RAM being live migrated over 10Gbps. Jumbo frames enabled, Power Settings optimized but with C1E & C States enabled.

clip_image006

A SQL Server with 50GB of RAM being live migrated over 10Gbps. Jumbo frames enabled, Power Settings optimized but with C1E & C States disabled.

The live migration of this virtual SQL Server takes between 74-78 seconds. Not bad!

image

By the way these settings also help with 1Gbps but there is isn’t as spectacular. You use 99% instead of 75-80% of you bandwidth. And improvement yes, but not on the same scale as with 10Gbps for speeding up Live Migrations.

As you can see in this post on the TechNet support groups, this seems to be a common occurrence. It’s not just me who’s seeing things: Live Migration on 10GbE only 16%. even Dell chimed in there confirming these findings in their labs.

Receive Buffer

There is one setting that’s been advised for Exchange 2010 virtualization with Hyper-V that I have not seen improve speeds and that’s upping the Receive Buffer 8192. You can read this in Best Practices for Virtualizing Exchange Server 2010 with Windows Server® 2008 R2 Hyper V™. In some cases I tested this even reduces the results, especially when you have C1E & C states enabled. It is also a confusing recommendation as they state to set the Receive Buffer to 8192 .This value however is dependent on the NIC type and driver so you might only be able to set it to 4096 or so. The guidance should state to set it as high a possible but I have not seen any benefits. Do mind that I did not test this with a Hyper-V cluster running a virtualized Exchange 2010 guest. Your mileage may vary. Trust but verify is the age old adagio. Also keep in mind I’m running 10Gbps, so the effect of this setting might be not be what it could do for a 1Gbps network, but on the whole I’m not convinced. If you implement all other recommendations you’ll saturate a 1Gbps already.

What does this mean?

The sad news is that in virtual environments or other high performance configurations the penguins have to give way to performance. I wish it was different but unfortunately it isn’t.

By the way, this is vendor agnostic. You’ll see this with HP, DELL, CISCO in all form factors whether they are tower, rack or blade servers. The main thing you need to make sure is that the BIOS allows you to disable the C States en power settings. Not all vendors/BIOS version allow for this I read so make sure you check this. Some CISCO blades have annoying on this front, ruining the performance of VDI projects with less than optimal CPU performance but they have released an updated BIOS now to fix this.

Look, it makes no sense saving on power if it means you’ll by more servers to compensate for the lack of performance per unit. In my honest opinion a lot of all the hardware optimizations are awesome but they still have a long way to go in making sure it doesn’t incur such a hit even on performance. Right sizing servers in number & type of CPU, power supplies etc. still seems the best way to avoid waiting energy and money. Buying more power than needed and counting on the power consumption optimizations to reduce operating cost can be a good idea to protecting your investment for expected future increases in resource demand within the service life of your hardware. On average that is 3 to 5 years depending on the environment & needs.

Conclusion

Three things are needed for lightning fast Live Migrations:

  1. Bandwidth. Hence the 10Gbps network. There is no substitute for bigger pipes.
  2. Jumbo Frames. Configure them right & you’ll reap the benefits
  3. Disable C1E& C states. Also Configure your servers power options for maximum performance.
  4. I have not been able to confirm the receive buffer has a big impact on Live Migration speed or does any good at all. Test this to find out if it works for you

Remember that you’ll be able to do multiple Live Migrations in parallel with Windows 8. So a 10Gbps pipe will be used at full capacity then. Being able to use more networks for Live Migration will only increase the capability to evacuate a host fast or to move virtual machines for load balancing across a cluster. If you look at the RDMA, infiniband, 40/100Gbps evolutions becoming available in the next 12 to 36 months 10Gbps will become a lot more mainstream while at the same time the options for network connectivity will become more diversified. 10Gbps prices are dropping but for the moment they do remain high enough to keep people away.

Introducing 10Gbps & Integrating It into Your Network Infrastructure (Part 4/4)


This is a 4th post in a series of 4. Here’s a list of all parts:

  1. Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
  2. Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
  3. Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
  4. Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

In my blog post “Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)” in a series of thoughts on 10Gbps and Hyper-V networking a discussion on NIC teaming brought up the subject of 10Gbps for virtual machine networks. This means our switches will probably no longer exist in isolation unless those virtual Machines don’t ever need to talk to anything outside what’s connected to those switches. This is very unlikely. That means we need to start thinking and talking about integrating the 10Gbps switches in our network infrastructure. So we’re entering the network engineers their turf again and we’ll need to address some of their concerns. But this is not bad news as they’ll help us prevent some bad scenarios.

Optimizing the use of your 10Gbps switches

As not everyone runs clusters big enough, or enough smaller clusters, to warrant an isolated network approach for just cluster networking. As a result you might want to put some of the remaining 10Gbps ports to work for virtual machine traffic. We’ve already pointed out that your virtual machines will not only want to talk amongst themselves (it’s a cluster and private/internal networks tend to defeat the purpose of a cluster, it just doesn’t make any sense as than they are limited within a node) but need to talk to other servers on the network, both physical and virtual ones. So you have to hook up your 10Gbps switches from the previous example to the rest of the network. Now there are some scenarios where you can keep the virtual machine networks isolated as well within a cluster. In your POC lab for example where you are running a small 100% virtualized test domain on a cluster in a separate management domain but these are not the predominant use case.

But you don’t only have to have to integrate with the rest of your network, you may very well want to! You’ve seen 10Gbps in action for CSV and Live Migration and you got a taste for 10Gbps now, you’re hooked and dream of moving each and every VM network to 10Gbps as well. And while your add it your management network and such as well. This is nothing different from the time you first got hold of 1Gbps networking kit in a 100 Mbps world. Speed is addictive, once you’re hooked you crave for more Smile

How to achieve this? You could do this by replacing the existing 1Gbps switches. That takes money, no question about it. But think ahead, 10Gbps will be common place in a couple of years time (read prices will drop even more). The servers with 10Gbps LOM cards are here or will be here very soon with any major vendor. For Dell this means that the LOM NICs will be like mezzanine cards and you decide whether to plug in 10Gbps SPF+ or Ethernet jacks. When you opt to replace some current 1Gbps switches with 10gbps ones you don’t have to throw them away. What we did at one location is recuperate the 1Gbps switches for out of band remote access (ILO/DRAC cards) that in today’s servers also run at 1Gbps speeds. Their older 100Mbps switches where taken out of service. No emotional attachment here. You could also use them to give some departments or branch offices 1gbps to the desktop if they don’t have that yet.

When you have ports left over on the now isolated 10Gbps switches and you don’t have any additional hosts arriving in the near future requiring CSV & LM networking you might as well use those free ports. If you still need extra ports you can always add more 10Gbps switches. But whatever the case, this means up linking those cluster network 10Gbps switches to the rest of the network. We already mentioned in a previous post the network people might have some concerns that need to be addressed and rightly so.

Protect the Network against Loops & Storms

The last thing you want to do is bring down your entire production network with a loop and a resulting broadcast storm. You also don’t want the otherwise rather useful spanning tree protocol, locking out part of your network ruining your sweet cluster setup or have traffic specifically intended for your 10Gbps network routed over a 1Gbps network instead.

So let us discuss some of the ways in which we can prevent all these bad things from happening. Now mind you, I’m far from an expert network engineer so to all CCIE holders stumbling on to this blog post, please forgive me my prosaic network insights. Also keep in mind that this is not a networking or switch configuration course. That would lead us astray a bit too far and it is very dependent on your exact network layout, needs, brand and model of switches etc.

As discussed in blog post Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4) you need a LAG between your switches as the traffic for the VLANs serving heartbeat, CSV, Live Migration, virtual machines but now also perhaps the host management and optional backup network must flow between the switches. As long as you have only two switches that have a LAG between them or that are stacked you have not much risk of creating a loop on the network. Unless you uplink two ports directly with a network cable. Yes, that happens, I once witnessed a loop/broadcast storm caused by someone who was “tidying up” spare CAT5E cables buy plugging all ends up into free port switches. Don’t ask. Lesson learned: disable every switch port not in use.

Now once you uplink those two or more 10Gbps switches to your other switches in a redundant way you have a loop. That’s where the Spanning Tree protocol comes in. Without going into detail this prevents loops by blocking the redundant paths. If the operational path becomes unavailable a new path is established to keep network traffic flowing. There are some variations in STP. One of them is Rapid Spanning Tree Protocol (RSTP) that does the same job as STP but a lot faster. Think a couple of seconds to establish a path versus 30 seconds or so. That’s a nice improvement over the early days. Another one that is very handy is the Multiple Spanning Tree Protocol (MSTP). The sweet thing about the latter is that you have blocking per VLANs and in the case of Hyper-V or storage networks this can come in quite handy.

Think about it. Apart from preventing loops, which are very, very bad, you also like to make sure that the network traffic doesn’t travel along unnecessary long paths or over links that are not suited for its needs. Imagine the Live Migration traffic between two nodes on different 10Gbps switches travelling over the 1Gbps uplinks to the 1Gbps switches because the STP blocked the 10Gbps LAG to prevent a loop. You might be saturating the 1Gbps infrastructure and that’s not good.

I said MSTP could be very handy, let’s address this. You only need the uplink to the rest of the network for the host management and virtual machine traffic. The heartbeat, CSV and Live Migration traffic also stops flowing when the LAG between the two 10Gbps switches is blocked by the RSTP. This is because RSTP works on the LAG level for all VLANs travelling across that LAG and doesn’t discriminate between VLAN. MSTP is smarter and only blocks the required VLANS. In this case the host management and virtual machines VLANS as these are the only ones travelling across the link to the rest of the network.

We’ll illustrate this with some picture based on our previous scenarios. In this example we have the cluster network going to the 10Gbps switches over non teamed NICs. The same goes for the virtual machine traffic but the NICs are teamed, and the host management NICS. Let’s first show the normal situation.

 clip_image002

Now look at a situation where RSTP blocks the purple LAG. Please do note that if the other network switches are not 10Gbps that traffic for the virtual machines would be travelling over more hops and at 1Gbps. This should be avoided, but if it does happens, MSTP would prevent an even worse scenario. Now if you would define the VLANS for cluster network traffic on those (orange) uplink LAGs you can use RSTP with a high cost but in the event that RSTP blocks the purple LAG you’d be sending all heartbeat, CSV and Live Migration traffic over those main switches. That could saturate them. It’s your choice.

clip_image004

In the picture below MSTP saves the day providing loop free network connectivity even if spanning tree for some reasons needs to block the LAG between the two 10Gbps switches. MSTP would save your cluster network traffic connectivity as those VLAN are not defined on the orange LAG uplinks and MSTP prevents loops by blocking VLAN IDs in LAGs not by blocking entire LAGs

clip_image006

To conclude I’ll also mention a more “star like” approach in up linking switches. This has a couple of benefits especially when you use stackable switches to link up to. They provide the best bandwidth available for upstream connections and they provide good redundancy because you can uplink the lag to separate switches in the stack. There is no possibility for a loop this way and you have great performance on top. What’s not to like?

clip_image008

Well we’ve shown that each network setup has optimal, preferred network traffic paths. We can enforce these by proper LAG & STP configuration. Other, less optimal, paths can become active to provide resiliency of our network. Such a situation must be addressed as soon as possible and should be considered running on “emergency backup”. You can avoid such events except for the most extreme situations by configuring the RSTP/MSTP costs for the LAG correctly and by using multiple inter switch links in every LAG. This does not only provide for extra bandwidth but also protects against cable or port failure.

Conclusion

And there you have it, over a couple of blog posts I’ve taken you on a journey through considerations about not only using 10Gbps in your Hyper-V cluster environments, but also about cluster networks considerations on a whole. Some notes from the field so to speak. As I told you this was not a deployment or best practices guide. The major aim was to think out loud, share thoughts and ideas. There are many ways to get the job done and it all depends on your needs an existing environment. If you don’t have a network engineer on hand and you can’t do this yourself; you might be ready by now to get one of those business ready configurations for your Hyper-V clustering. Things can get pretty complex quite fast. And we haven’t even touched on storage design, management etc.. The purpose of these blog was to think about how Hyper-V clustering networks function and behave and to investigate what is possible. When you’re new to all this but need to make the jump into virtualization with both feet (and you really do) a lot help is available. Most hardware vendors have fast tracks, reference architectures that have a list of components to order to build a Hyper-V cluster and more often than not they or a partner will come set it all up for you. This reduces both risk and time to production. I hope that if you don’t have a green field scenario but want to start taking advantage of 10Gbps networking; this has given you some food for thought.

I’ll try to share some real life experiences,what improvements we actually see, with 10Gbps speeds in a future blog post.