Virtualization with Hyper-V & The NUMA Tax Is Not Just About Dynamic Memory


First of all to be able to join in this little discussion you need to know what NUMA is and does. You can read up on that on the Intel (or AMD) web site like http://software.intel.com/en-us/blogs/2009/03/11/learning-experience-of-numa-and-intels-next-generation-xeon-processor-i/ and http://software.intel.com/en-us/articles/optimizing-software-applications-for-numa/. Do have a look at the following SQL Skills Blog http://www.sqlskills.com/blogs/jonathan/post/Understanding-Non-Uniform-Memory-AccessArchitectures-(NUMA).aspx which has some great pictures to help visualize the concepts.

What Is It And Why Do We Care?

We all know that a CPU contains multiple cores today. 2,4,6,8,12,16 etc. cores. So in terms of a physical CPU we tend to talk about a processor that fits in a socket and about cores for logical CPUs. When hyper threading is enabled you double the logical processors seen and used. It is said that Hyper-V can handle hyper threading so you can leave it on. The logic being that it will never hurt performance and can help to improve it. I suggest you test it Smile as there was a performance bug with it once.  A processor today contains it own memory controller and access to memory from that processor is very fast. The NUMA node concept is older than the multi core processor technology but today you can state that a NUMA node translates to one processor/socket and all cores contained in that processor belong to the same NUMA node. Sometimes a processors contains two NUMA node like the AMD 12 core processors. In the future, with the ever increasing number of cores, we’ll perhaps see even more NUMA nodes per processor. You can state that all Intel processors since Nehalem with Quick Path Interconnect and AMD processors with Hyper-Transport are NUMA processors. But To be sure, check with your vendors before buying. Assumptions right?

Beyond NUMA nodes there is also a thing called processor groups which help Windows to use more than 64 logical processors (its former limit) by grouping logical processors into groups of which Windows handle 4 meaning in total Windows today can support 4*64=256 logical processors. Due to the fact that memory access within a NUMA node is a lot faster than between NUMA nodes you can see where a potential performance hit is waiting to happen. I tried to create a picture of this concept below. Now you know why I don’t make my living as a graphical artist Eye rolling smile

imageimage

 

To make it very clear NUMA is great and helps us in a lot of ways. But under certain conditions and with certain applications it can cause us to take a (serious) performance hit. And if there is anything certain to ruin a system administrators day than it is a brand new server with a bunch of CPUs and loads of RAM that isn’t running any better (or worse?) than the one you’re replacing. Current hyper visors like Hyper-V are NUMA aware and the better servers like SQL Server are as well. That means that under the hood they are doing their best to optimize the CPU & memory usage for performance. They do an very good job actually and you might, depending on your environment never, ever know of any issue or even the existence of NUMA.

But even with a NUMA knowledgeable hyper visor and NUMA aware applications you run the risk of having to go to remote memory. The introduction of Dynamic Memory in Windows 2008 R2 SP1 evens increases this likelihood as there is a lot of memory reassigning going on. Dynamic Memory actually educated a lot of Hyper-V people on what NUMA is and what to look out for. Until Dynamic Memory came on the scene, and the evangelizing that came with it by Microsoft, it was "only" the people virtualizing  SQL Server or Exchange & other big hungry application that were very aware of NUMA with its benefits and potential draw backs. If  you’re lucky the application is NUMA aware, but not all of them are, even the big names.

A Peak Into The Future

As it bears on this discussion, what is interesting that leaked screenshots from Hyper-V 3.0 or vNext  … have NUMA configuration options for both memory and CPU at the virtual machine level! See Numa Settings in Hyper-V 3.0 for a picture. So the times that you had to script WMI calls (see http://blogs.msdn.com/b/tvoellm/archive/2008/09/28/looking-for-that-last-once-of-performance_3f00_-then-try-affinitizing-your-vm-to-a-numa-node-.aspx) to assign a VM to a NUMA node might be over soon (speculation alert) and it seems like a natural progression from the ability to disable NUMA with W2K8R2SP1 Hyper-V in case you need it to avoid NUMA issues at the Hyper-V host level. Hyper-V today is already pretty NUMA aware and as such it will try to get all memory for a virtual machine from a single NUMA node and only when that can’t be done will it span across NUMA nodes. So as stated, Hyper-V with Windows Server 2008 R2 SP1 can prevent this form happening as we can disable NUMA for a Hyper-V host now. The downside is that you can’t get more memory even if it’s available on the host.

NumaSpanning

A working approach to reduce possible NUMA overhead is to limit the number of CPUs to 2 as this gives the largest amount of memory to the CPUs, in this case 50%. 4 CPUs only control 25%, etc.So with more CPU (and NUMA nodes) the risk of NUMA spanning is getting bigger very fast. For memory intensive applications scaling out is the way to go. Actually you could state that we do scale up the NUMA nodes per socket (lots of cores with the most amount of direct accessible memory possible) and as such do not scale up the server. If you can keep your virtual machines tied to a single CPU on a dual socket server to try and prevent any indirect memory access and thus a performance hit. But that won’t always work. If you ever wondered when an 8/12/16 core CPU comes in handy, well voila … here a perfect case: packing as many cores on a CPU becomes very handy when you want to limit sockets to prevent NUMA issues but still need plenty of CPU cycles. This should work as long as you can address large amounts of RAM per socket at fast speeds and the CPU internally isn’t cut up into to many multiple NUMA nodes, which would be scaling out NUMA node in the same CPU and we don’t want that or we’re back to a performance penalty.

Stacking The Deck

One way of stacking the deck in your favor is to keep the heavy apps on their own Hyper-V cluster. Then you can tweak it all you want to optimize for SQL Server, Exchange, … etc. When you throw these virtual machines in your regular clusters or for crying out loud on a VDI cluster your going to wreak havoc on the performance. Just like mixing server virtualization & VDI is a bad idea (don’t do it), throwing vCPU hungry, memory hogging servers on those cluster is just killing of performance and capacity of a perfectly good cluster. I have gotten into arguments over this as some thing one giant cluster for whatever need is better. Well no, you’ll end up micro managing placement of VM with very different needs on that cluster effectively “cutting” it up in smaller “cluster parts”. Now is separate clusters for different needs always the better approach? No, it depends, If you only have some small SQL Server needs you  get away with one nice cluster. It depends, I know, the eternal consultants answer, but I have to say it. I don’t want to get angry mails from managers because someone set up a 6 node clusters for a couple of SQL Server Express databases Winking smile There are also concepts called testing, proof of concept, etc. It’s called evidence based planning. Try it, it has some benefits that become very apparent  when you’re going to virtualize beefy SQL Server, SharePoint and Exchange servers.

How do you even know it is happening apart from empirical testing. Aha, excellent question! Take a look at the "Hyper-V VM Vid Numa Node" counter set and read this blog entry by on this subject http://blogs.msdn.com/b/tvoellm/archive/2008/09/29/hyper-v-performance-counters-part-five-of-many-hyper-vm-vm-vid-numa-node.aspx. And keep an eye on the event log for http://technet.microsoft.com/hi-in/library/dd582929(en-us,WS.10).aspx (for some reason there is no comparable entry for W2K8R2 on TechNet)

Conclusions

To conclude, all of the above people is why I’m interested in the some of the latest generation of servers. The architecture of the hardware allows for a the processor to address twice the "normal" amount of memory when you only put dual CPUs on a quad socket motherboard. The Dell PowerEdge R810 and the M910 have this and it’s called a FlexMemory Bridge and that allows more memory to be available without a performance hit. They also allow for more memory per socket at higher speeds. If you put a lot of memory directly addressable to one CPU you see a speed drop. A DELL R710 with 48 GB of RAM runs at 1033 MHZ  but put 96 GB in there and you fall back to 800 Mhz. So yes, bring on those new quad socket motherboards with just 2 sockets used, a bunch of fast direct accessible memory in a neat 2 unit server package with lost of space for NIC cards & FC HBAs if needed. Virtualization heaven :-) That’s what I want so I can give my VMs running SQL Server 2008 R2 & "Denali" (when can I call it SQL Server 2012?) a bigger amount of direct accessible memory form their NUMA node. This can be especially helpful if you need to run NUMA unaware applications like SAP or such. Testing is the way to go for knowing how well a NUMA aware hyper visor and a NUMA aware application figure out the best approach to optimize the NUMA experience together.  I’m sure we’ll learn more about this as more and more information becomes available and as technology evolves.  For now we optimize for performance with NUMA where we can, when we can with what we have :-) For Exchange 2010 (we even have virtualization support for DAG mailbox servers now as well) scaling out is easier as we have all the neatly separate roles and control just about everything down to the mail client. With SQL Server applications this is often less clear. There is a varied selection of commercial and home grown applications out there and a lot of them can’t even scale out, only up. So your mileage of what you can achieve may vary. But for resource & memory heavy applications under your control, for now, scaling out is the way to go.

Building A New Lab For 2011 And Beyond


Well with all this (Hyper-V) Clustering, Virtualization, System Center Suite, Exchange 2010 & Lync, SQL Servers, iSCSI demands on my lab network  I really need to refresh my hard ware. It sounds a bit like a paradox but such is life for the people building all this stuff. Yes, they still need some hardware, pretty beefy machines actually, to set it all up, test it, break it, fix it and keep learning. I’ve depleted my 4 years old lab material which in which I can’t put more than 4 GB RAM.  Now that I have finished all my infrastructure projects for 2010 I have time to focus on improving my old setup. Or at least I hope. Things are very busy. Thanks to W2K8R2 SP1 beta I could use Dynamic Memory which helped to keep churning away with these and various Exchange setups but now with Lync coming into the picture I want and need an upgrade.  A couple of SQL Servers in various high availability setups help eat any remaining resources resources . Add to that the fact that I want to do some private cloud testing so there it is. I need hosts with at least an Intel Quad Core  (i7) and at least 16 GB of DDR3 memory. They should have room for extra NIC cards. And I always try to get some speedy disks where it matters.  Now since Windows Server 2008 R2  added support for Second Level Address Translation (SLAT), which Intel calls Extended Page Tables (EPT) and which AMD calls Nested Page Tables (NPT) or Rapid Virtualization Indexing (RVI), we can make use of better graphics cards. Until now none of my processors had SLAT support.  With the Intel i7 (Nehalem) processor I’m good to go. As all machine in my lab are Intel so I’m sticking with them for Hyper-V migrations as that doesn’t work between brands.

So here’s an logical overview of my setup. This is what I already in place with my current hardware but have now drawn with my coveted hardware refreshment Smile Oh, yes the dual 1Gbps switches for iSCSI are new for this setup. I’m adding one so I can play with MPIO in the lab.

NewLab

 

For disks I use 300GB – 16MB – 10.000 rpm and 600GB –32MB – 10.000rpm Raptors in combination with an external eSATA 1TB/2TB Western Digital Black Disk for storage of VHD’s, Images, backups etc.  I have to buy some extra now. The faster disks are expensive but a lab environment needs some performance as waiting around for servers & virtual machines becomes a major of annoyance when you need to get work done. The 10.000 rpm disks are great for iSCSI storage for which I use the iSCSI Target from Windows 2008 R2 Storage server via my TechNet subscription.

All this kit should keep me up and running from 2011 until the end of 2014. Is this expensive? Yes and no.  I can recuperate my 1 Gbps Intel NIC’s and most of my hard disks.  I already have my network switches, monitors and KVM switches. So in all it’s the new motherboards, CPU’s and memory that will eat the  most of the budget.  It’s a sum to put out but here’s a note to all IT Pro’s out there. You need to invest in yourself every now and then.

I’ve blogged about this before in http://workinghardinit.wordpress.com/2010/02/04/having-a-lab-using-it/. Self improvement and learning is a continuous process that never ends. Sure it does have some peak moments in financial costs when you need equipment. Remember you don’t need to buy it all at once. Talk to you employer about this if you’re not self employed. Look at how much a 5 day advanced course or a conference costs. You can use a lab to learn and experiment for many years to come. So basically the potential ROI is very good. In the end, what my employers and customers get out of this is knowledge, insight, skills and results. Think about it, it helps to put the investment in perspective. Sure, I invest more than just the hardware, my time which is very valuable to me. You can’t maker more time, everyone has the same 24 hours in a day. Now it really helps if you like this stuff and have fun whilst learning new technologies or setting up a proof of concept. In a way what people put into their job and knowledge is  an indicator of their professionalism. You do not become an expert by working 9 to 5 and only learning when a course is provided. It’s not going to happen. Even a genius who puts in the effort stands out amongst his or her peers. The same goes for you, but be smart about it. You can work yourself to death and not accomplish anything. So smart & hard is the way to go.

New Spatial & High Availability Features in SQL Server Code-Named “Denali”


The SQL Server team is hard at work on SQL Server vNext, code name “Denali”. They have a whitepaper out on their web site, “New Features in SQL Server Code-Named “Denali” Community Technology Preview 1” which you can download here.

As I do a lot of infrastructure work for people who really dig al this spatial and GIS related “stuff” I always keep an eye out for related information that can make their lives easier an enhance the use of the technology stack they own.  Another part of the new features coming in “Denali” is Availability Groups. More information will be available later this year but for now I’ll leave you with the knowledge that it will provide for Multi-Database Failover, Multiple Secondaries, Active Secondaries, Fast Client Connection Redirection, can run on Windows Server Core & supports Multisite (Geo) Clustering as shown in the Microsoft (Tech Ed Europe, Justin Erickson) illustration below.DenaliAvailabilityGroup

Availability Group can provide redundancy for databases on both standalone instances and failover cluster instances using Direct Attached storage (DAS), Network Attached Storage (NAS) and Storage Area Networks (SAN) which is useful for physical servers in a high availability cluster and virtualization. The latter is significant as they will support it with Hyper-V Live Migration where as Exchange 2010 Database Availability Groups do not. I confirmed this with a Microsoft PM at Tech Ed Europe 2010.  Download the CTP here and play all you want. Please pay attention to the fact that in CTP 1 a lot of stuff  isn’t quite ready for show time. Take a look at the Tech Europe 2010 Session on the high availability features here. You can also download the video and the PowerPoint presentation via that link. At first I thought MS might be going the same way with SQL as they did with Exchange, less choice in high availability but easier and covering all needs but than I don’t think they can. SQL Server Applications are beyond the realm of control of Redmond. They do control Outlook & OWA. So I think the SQL Server Team needs to provide backward compatibility and functionality way more than the Exchange team has. Brent Ozar (Twitter: @BrentO)  did a Blog on “Denali”/Hadron which you can read here http://www.brentozar.com/archive/2010/11/sql-server-denali-database-mirroring-rocks/. What he says about clustering is true. I’ use to cluster Windows 2000/2003 and suffered some kind of mental trauma. That was completely cured with Windows 2008 (R2) and I’m now clustering with Hyper-V, Exchange 2010, File Servers, etc. with a big smile on my face. I just love it!