Options For A Highly Available Load Balanced RD Gateway Server Farm on Hyper-V


When you need to make the RD Gateway service highly available you have some options. On the RD Gateway side you have capability of configuring a farm with multiple RD Gateway servers.image

When in comes to the actual load balancing of the connections there are some changes in respect load balancing from Windows Server 2008 R2 that you need to de aware of! With Windows 2008 R2 you could do:

  1. Load balancing appliances (KEMP Loadmaster for example, F5, A10, …) or Application Delivery Controllers, which can be hardware, OEM servers, virtual and even cloud based (see Load Balancing In An Ever More Demanding Virtualized & Cloudy World). KEMP has Hyper-V appliances, many others don’t. These support layer 4, layer 7, geo load balancing etc. Each has it’s use cases with benefits and drawback but you have many options for the many situations you might encounter.
  2. Software load balancing. With this they mean Windows NLB. It works but it’s rather limited in regards to intelligence for failure detection & failover. It’s in no way an “Application Delivery Controller” as load balancer are positioned nowadays.
  3. DNS Round Robin load balancing. That sort of works but has the usual drawbacks for problem detection and failover.  Don’t get me wrong for some use cases it’s fine, but for many it isn’t.

I prefer the first but all 3 will do the basic job of load balancing the end-user connections based on the traffic. I have done 2 when it was good enough or the only option but I have never liked 3, bar where it’s all what’s needed, because it just doesn’t fit many of the uses cases I dealt with. It’s just too limited for many apps.

In regards to RD Gateway in Windows Server 2012 (R2), you can no longer use  DNS Round Robin for load balancing with the new HTTP transport. The reason is that it uses two HTTP channels (one for input and one for output) and DNS round robin cannot guarantee that both these connections will be routed trough the same RD Gateways server which is a requirement for it to work. Basically RRDNS will only work for legacy RPC-HTTP. RPC could reroute a channel to make sure all flows over the same node at the cost of performance & scalability. But that won’t work with HTTP which provides scalability & performance. Another thing to note is that while you can work without UDP you don’t want to. The UDP protocol is used  to deliver graphics with a better user experience  over even low quality networks for graphics or high and experiences with RemoteFX. TCP (HTTP) is can be used without it (at the cost of a lesser experience) and is also used to maintain the sessions and actions. Do note that you CANNOT use UDP alone as these connections are established only after the main HTTP connection exists between the remote desktop client and the remote desktop server. See Don’t Forget To Leverage The Benefits of RD Gateway On Hyper-V & RDP 8/8.1 for more information

So you will need a least Windows Network Load Balancing (WNLB) because that supports IP affinity to make sure all channels stick to the same node. UDP & HTTP can be on different nodes by the way. Also please not that when using network virtualization WNLB isn’t a good choice. It’s time to move on.

So the (or at least my) preferred method is via a real “hardware” load balancer.  These support a bunch of persistence options like IP affinity, cookie-based affinity, … just look at the screenshot below (KEMP Loadmaster)

image

But they also support layer 7 functionality for better health checking and failover.  So what’s not to like?

So we need to:

  1. Build a RD Gateway Farm with at least two servers
  2. Load balance HTTP/HTTPS for the RD Gateway farm
  3. Load balance UDP for the RD Gateway farm.

We’ll do this 100% virtualized on Hyper-V and we’ll also make make the load balancer it self highly available. Remember, removing single points of failure are like bottle necks. The moment you take one away you just hit the next one Smile.

Kemp has a great deployment guide for RDS on how to do this but I should ass that you could leverage SUB Virtual Services (SUBVS) to deal with the other workloads such as RD Web Access if they’re on the same server. They don’t mention this in the white paper but it’s an option when using HTTP/HTTPS as service type for both configurations. #1 & #2 are the SUB Virtual Services where I used this in a lab.

image

But for RD Gateway you can also leverage the Remote Terminal Service type and in this case you won’t leverage SUBVS as the service type is different between RD Gateway (Remote Terminal) and RD Web Access (HTTP/HTTPS). This is actually used by their RDS template you can download form their support site.

image

Hope this helps some of you out there!

Quick Demo Video Of Site Failover With KEMP Loadmaster Global Balancing


Here’s a quick video that demonstrates how you can achieve site failover with via the KEMP Loadmaster Global Balancing feature. As long as you know what this can do for you and realize that it about site failover and high availability and not continuous availability without a second of service interruption you can deliver nice results with this technology across city campuses or between cities.

In our scenario we normally connect to the primary data center (weighted round robin) and fail over to the DRC when the primary site fails for some reason.

It’s very busy at the moment but I hope to address this topic a bit more in detail in the future. All of this runs virtualized on Hyper-V and performs just fine.

SMB Direct With RoCE in a Mixed Switches Environment


I’ve been setting up a number of Hyper-V clusters with  Mellanox ConnectX3 Pro dual port 10Gbps Ethernet cards. These Mellanox cards provide a nice amount of queues (128) for DVMQ and also give us RDMA/SMB Direct capabilities for CSV & live migration traffic.

Mixed Switches Environments

Now RoCE and DCB is a learning curve for all of us and not for the faint of heart. DCB configuration is non trivial, certainly not across multiple hops and different switches. Some say it’s to be avoided or can’t be done.

You can only get away with a single pair of (uniform) switches in smaller deployments. On top of that I’m seeing more and more different types of switches being used to optimize value, so it’s not just a lab exercise to do this. Combine this with the fact that DCB is an unavoidable technology in networking, unless it get’s replaced with something better and easier, and you might as well try and learn. So I did.

Well right now I’m successfully seeing RoCE traffic going across cluster nodes spread over different racks in different rows at excellent speeds. The core switches are DELL Force10 S4810 and the rack switches are PowerConnect 8132Fs. By borrowing an approach from spine/leave designs this setup delivers bandwidth where they need it a a price point they can afford. They don’t need more expensive switches for the rack or the core as these do support DCB and give the port count needed at the best price point.  This isn’t supposed to be the top in non blocking network design. Nope but what’s available & affordable today in you hands is better than perfection tomorrow. On top of that this is a functional learning experience for all involved.

We see some pause frames being sent once in a while and this doesn’t impact speed that very much. It does guarantee lossless traffic which is what we need for RoCE. When we live migrate 300GB worth of memory across the nodes in the different racks we get great results. It varies a bit depending on the load the switches & switch ports are under but that’s to be expected.

Now tests have shown us that we can live migrate just as fast with non RDMA 10Gbps as we can with RDMA leveraging “only” Multichannel. So why even bother? The name of the game low latency and preserving CPU cycles for SQL Server or storage traffic over SMB3. Why? We can just buy more CPUs/Cores. Great, easy & fast right? But then with SQL licensing comes into play and it becomes very expensive. Also storage scenarios under heavy load are not where you want to drop packets.

Will this matter in your environment? Great question! It depends on your environment. Sometimes RDMA is needed/warranted, sometimes it isn’t. But the Mellanox cards are price competitive and why not test and learn right? That’s time well spent and prepares you for the future.

But what if it goes wrong … ah well if the nodes fail to connect over RDAM you still have Multichannel and if the DCB stuff turns out not to be what you need or can handle, turn it of and you’ll be good.

RoCE stuff to test: Routing

Some claim it can’t be done reliably. But hey they said that for non uniform switch environments too Winking smile. So will it all fall apart and will we need to standardize on iWarp in the future?  Maybe, but isn’t DCB the technology used for lossless, high performance environments (FCoE but also iSCSI) so why would not iWarp not need it. Sure it works without it quite well. So does iSCSI right, up to a point? I see these comments a lot more form virtualization admins that have a hard time doing DCB (I’m one so I do sympathize) than I see it from hard core network engineers. As I have RoCE cards and they have become routable now with the latest firmware and drivers I’d love to try and see if I can make RoCE v2 or Routable RoCE work over different types of switches but unless some one is going to sponsor the hardware I can’t even start doing that. Anyway, lossless is the name of the game whether it’s iWarp or RoCE. Who know what we’ll be doing in 5 years? 100Gbps iWarp & iSCSI both covered by DCB vNext while FC, FCoE, Infiniband & RoCE have fallen into oblivion? We’ll see.

Hyper-V Guest Protected Network Testing Tip


I’ve been pinged a few times over the years with people saying that the new protected network feature does not work for them. This setting is set per vNIC of the virtual machine.

image

The issue lies in how & what people test, bar any number of other reasons why a live migration might not start or complete.  What people tend to do is disable a NIC to which the vSwitch is connected. But a Protected Network is about media sense loss detection of network disconnects and this requires the NIC to be actually there and enabled. Remember, we’re talking about the NIC on the host connected to the virtual switch. A physical link failure here, meaning that the virtual switch the protected virtual network adapter no longer has network connectivity, will lead to all the VMs with  the protected network enabled do be live migrated to another node in the cluster that still has a connected virtual switch for the same network.  The latter is to avoid  senseless virtual machine migrations to other nodes that might also have lost connectivity due to a failed physical switch.

So the point is that testing by disabling the NIC in the OS will not do. You need to unplug the cables to the virtual switch or disable the port on the switch or even shutdown the switch (a bit drastic).

Do note that it can take a little time for the live migration to kick in,  it varies a bit, but it beats having to wait for the issue to be resolved. You’ll see event id 1255 logged when the VMs lose network connectivity:image

In this day and age with NIC teaming to redundant switches & the fact that you might be using converged networking these tests aren’t as simple as you might think. Also don’t pull out all if the cables used for clustering if you want the cluster to be able to help you out here with a live migration. Because when the other cluster nodes can’t talk to the node your testing in any way it will be kicked out of the cluster, the VMs will go down, be moved to another node and started. This might seem obvious but if you a are using a teamed 10Gbps solution in a converged setup this might cause exactly that.

Another thing to note is that if you have a virtual switch with a dedicated backup network exposed to hosts & VMs that can tolerate down time you might want to disable protected networks on that vNIC as you don’t want to live migrate the VMs of when that network has an issue. It all depends on your needs & tastes.

Last but not least please behave, and don’t do anything silly in production when testing this. Be careful in your testing.

Hot add/remove of network adapters and enabling device naming in Windows Server Hyper-V


One of the cool new features in Window Server vNext Hyper-V (in Technical Preview at the moment of writing) is that you gain the ability to hot add and remove NICs.  That might sound not to important to the non initiated in the fine art of virtualization & clouds. But it is. You see anything you can do to a VM configuration wise that does not require downtime is good. That’s what helps shift the needle of high availability to that holey grail of continuous availability.

On top of that the names of the network adapters are now exposed to the guest. Which is also great. It’s become lot easier to automate the VM network configuration.

Hot adding NICs can be done via the GUI and PoSh.

image

But naming the network adapter seems a PowerShell only game for now (nothing hard, no sweat). This can be done during creation of the network adapter. Here I add a NIC to VM RAGNAR connected to the ISCSI-GUEST switch & named ISCSI.

Add-VMNetworkAdapter –VMName RAGNAR –SwitchName ISCSI-GUEST –Name ISCSI

Now I want this name to be reflected into the VM’s NCI configuration properties. This is done by enabling device naming. You can do this via the GUI or PoSh.

Set-VMNetworkAdapter –VMName RAGNAR –Name ISCSI –Devicenaming On

That’s it.

image

So now let’s play with our existing network adapter “Network Adapter” which connects our Hyper-V guests to the LAN via the HYPER-V-GUESTS switch? Can you rename it?  Yes, you can. In PoSh run this:

Rename-VMNetworkAdapter –VMName RAGNAR –Name “Network Adapter” –NewName “LAN”

And that’s it. If you refresh the setting of your VM or reopen it you’ll see the name change.

image

The one thing that I see in the Tech Preview is that I need to reboot the VM to see the Name change reflected inside the VM in the NIC configuration under advance properties, called “Hyper-V Network Adapter Name”. Existing one show their old name and new one are empty until then.

image

 

Two important characteristics to note about enabling device naming

You notice that one can edit this field in NIC configuration of the VM but it doesn’t move up the stack into the settings of the VM. Security wise this seems logical to me and it’s not intended to work. It’s a GUI limitation that the field cannot be disabled for editing but no one can try and  be “funny” by renaming the ethernet adapter in the VMs settings via the guest Winking smile

Do note that this is not exactly the same a Consistent Device Naming in Windows 2012 or later. It’s not reflected in the name of the NIC in the GUI, these are still Ethernet, Ethernet 2, … Enable device naming is mainly meant to enable identifying the adapter assigned to the VM inside the VM, mainly for automation. You can name the NIC in the Guest whatever works best for you and you’ll never lose the correlations between the Network adapter in your VM settings and the Hyper-V Network Adapter name in the NIC configuration properties. In that respect is a bit more solid/permanent even if some one found it funny to rename all vNICs to random names you’re still OK with this feature.

That’s it off, you go! Download the Technical Preview bits from MSDN, start exploring and learning. Knowledge is seldom a bad thing Winking smile

Configuring timestamps in logs on DELL Force10 switches


When you get your Force10 switches up and running and are about to configure them you might notice that, when looking at the logs, the default timestamp is the time passed since the switch booted. During configuration looking at the logs can very handy in seeing what’s going on as a result of your changes. When you’re purposely testing it’s not too hard to see what events you need to look at. When you’re working on stuff or trouble shooting after the fact things get tedious to match up. So one thing I like to do is set the time stamp to reflect the date and time.

This is done by setting timestamps for the logs to datetime in configuration mode. By default it uses uptime. This logs the events in time passed since the switch started in weeks, days and hours.

service timestamps [log | debug] [datetime [localtime] [msec] [show-timezone] | uptime]

I use: service timestamps log datetime localtime msec show-timezone

F10>en
Password:
F10#conf
F10(conf)#service timestamps log datetime localtime msec show-timezone
F10(conf)#exit

Don’t worry if you see $ sign appear left or right of your line like this:

F10(conf)##$ timestamps log datetime localtime msec show-timezone

it’s just that the line is to long and your prompt is scrolling Winking smile.

This gives me the detailed information I want to see. Opting to display the time zone and helps me correlate the events to other events and times on different equipment that might not have the time zone set (you don’t always control this and perhaps it can’t be configured on some devices).

image

As you can see the logging is now very detailed (purple). The logs on this switch were last cleared before I added these timestamps instead op the uptime to the logs. This is evident form the entry for last logging  buffer cleared: 3w6d12h (green).

Voila, that’s how we get to see the times in your logs which is a bit handier if you need to correlate them to other events.

Migrate A Windows 2003 RADIUS–IAS Server to Windows Server 2012 R2


Some days you walk into environments were legacy services that have been left running for 10 years as:

  1. They do what they need to do
  2. No one dares touch it
  3. Have been forgotten, yet they provide a much used service

Recently I had the honor of migrating IAS that was still running on Windows Server 2003 R2 x86, which was still there for reason 1. Fair enough but with W2K3 going it’s high time to replace it. The good news was it had already been virtualized (P2V) and is running on Hyper-V.

Since Windows 2008 the RADIUS service is provided by Network Policy Server (NPS) role. Note that they do not use SQL for logging.

Now in W2K3 there is no export/import functionality for the configuration in IAS. So are we stuck? Well no, a tool has been provided!

Install a brand new virtual machine with W2K12R2 and update it. Navigate to C:\Windows\SysWOW64 folder and grab a copy of IasMigReader.exe.

image

Place IasMigReader.exe in the C:\Windows\System32 path on the source W2K3 IAS server as that’s configured in the %path% environment variable and it will be available anywhere from the command prompt.

  • Open a elevated command prompt
  • Run IasMigReader.exe

image

  • Copy the resulting ias.txt file from the  C:\Windows\System32\IAS\folder. Please keep this file secure it contains password. TIP: As a side effect you can migrate your RADIUS even if no one remembers the shared secrets and you now have them again Winking smile

image

Note: The good news is that in W2K12 (R2) the problem with IasMigReader.exe generating a bad parameter in ias.txt is fixed ((The EAP method is configured incorrectly during the migration process from a 32-bit or 64-bit version of Windows Server 2003 to Windows Server 2008 R2). So no need to mess around in there.

  • Copy the ias.tx file to a folder on your target NPS server & run the following command from an elevated prompt:

netsh nps import <path>\ias.txt

image

  • Open the NPS MMC and check if this went well, normally you’ll have all your settings there.

image

When Network Policy Server (NPS) is a member of an Active Directory® Domain Services (AD DS) domain, NPS performs authentication by comparing user credentials that it receives from network access servers with the credentials that are stored for the user account in AD DS. In addition, NPS authorizes connection requests by using network policy and by checking user account dial-in properties in AD DS.

For NPS to have permission to access user account credentials and dial-in properties in AD DS, the server running NPS must be registered in AD DS.

Membership in Domain Admins , or equivalent, is the minimum required to complete this procedure.

  • All that’s left to do now is pointing the WAPs (or switches & other RADIUS Clients) to the new radius servers. On decent WAPs this is easy as either one of them acts as a controller or you have a dedicated controller device in place.
  • TIP: Most decent WAPS & switches will allow for 2 Radius servers to be configured. So if you want you can repeat this to create a second NPS server with the option of load balancing. This provides redundancy & load balancing very easily. Only in larger environments multiple NPS proxies pointing to a number of NPS servers make sense.Here’s a DELL PowerConnect W-AP105 (Aruba) example of this.

image