Post MVP Summit – Back to reality


Coming home after the MVP Global Summit is a moment of reflection, or better the trip home is. The Summit is a time of intense interaction with peers who are very varied bunch of experienced technologists. Next to their hands on Microsoft stack expertise they also bring their experience with other technologies and companies. This gives us the opportunity to talk to each other and exchange knowledge and views. Poor in the feedback and the discussions with the Microsoft Program Managers and their management. This goes on from sun rise to sunset. It pays to come early and stay an extra day. It opens time for more meetings and discussions in and around Redmond.

image

The end result is a truck load of information impressions we need to parse. That can take some time. And we need to filter our conclusions for our management. The content of the MVP Summit and all talks around it are strictly NDA. The insight and ideas we harvest from that we can leverage, but we cannot expose the information.

On Microsoft’s side they get a reality check, open and honest feedback, they get our opinions and ideas. They learn about our successes and challenges in the real world. If that was not helpful to them they wouldn’t want us to show up on their campus disrupting their work week.

To me it’s also a reality check. What am I doing. How am I doing that and why? Even more importantly where am I doing this things and is what I use the best choice. It show my own strengths and weaknesses. That’s valuable as well.

Well the good news is that judged by some requests and opinions of my peers I’m an valued expert and architect. I do have some weaknesses but I’m on track to address those. The balancing act here is that we have to avoid wasting time on dying opportunities that are sill needed but are heading down hill fast. Not as much due to the technology being obsolete or no longer needed but mostly due to politics and a bad understanding on how to deliver IT cost effective and efficiently. The amount of self inflicted wounds and pain can be shockingly high. The trick is to avoid those projects as that’s wasted time, time that should be spent on moving forward. Sometimes this looks like the nineteen nineties all over out there.

One thing is very clear. Those that seek a single solution, a one size fits all approach, just for the sake of simplicity or perceived economies of scale will fail. A bipolar approach without a place for the vast amount of “stuff” in between, let alone a realistic and sound technical plan to integrate it all are going to fail. Ask any plumber Winking smile. Learn how to think independently and don’t grow too dependent on industry analysts. Do what’s right for your needs.

Advertisements

NVM Express over Fabrics


Any technologist who’s read, let alone used NVM Express (NVMe), is pretty enthusiastic about it’s capabilities and if it was not for availability and financial restrictions we’d all have at least a couple in our home systems and labs. It seems to succeed very well in making sure the host can keep up with the performance (low latencies, high throughput)) delivered by SSD drives than our current interfaces.

This means that many are very happy with future visions on how PCIe will dislodge SAS/SATA as the preferred SSD interface. This might seem feasible for local storage right now but how to deal with this in an actual storage array, what if we want to size this to a larger scale? There are no “PCIe JBODS”. So what does one do? Well, how did we do it in the past with FC? We created a fabric. Below we see several local & remote NVMe architectures even hybrid ones with traditional SAS.

image

That’s exactly what NVM Express Inc. is doing, creating the specs for a fabric. This holds the promise to achieve superior results due to the elimination of SCSI translation which reduces latency significantly by delivering NVMe end to end. Not only that but we also see the following efforts in the NVM Express Specification 1.2 to give it enterprise grade capabilities beyond pure performance.

  • Enhanced status reporting
  • Expanded capabilities including live firmware updates

There have been some early demos of NVMe over Fabrics mainly focusing on the “remote” performance. While local NVMe SSDs have the edge on absolute IOPS the difference with NVMe over a fabric is not significant. The reduction in latency is measured in < 10 µs,so that’s good news. The fabric leverages RDMA (yes, yet another reason that my time spending with this technology has been a useful investment). This can be Infiniband, RoCE or iWarp. There’s also the new kid on the block “Intel Omni Scale”  (even if their early demo used iWARP). There’s also a Mellanox RoCE demo.

image

Now with NVMe SSD disk speeds it seems that the writing is on the wall that ever better fabric performance will be needed to support the tremendous throughput this evolution of storage can deliver. RDMA seems poised for success in regards to this. Now, yes, strictly speaking the NVMe traffic does not require RDMA but let’s just say I don’t see anyone building it without. I also think this means even iWarp fabrics will use DCB (PFC) to make sure we have a lossless network. The amount of traffics will be immense and why not optimize for the best possible performance? I hold the opinion this is beneficial for east-west traffic today in larger environments, especially when in converged networks. Unless the Intel® Omni-Path Architecture blows everyone else away that is Smile. Too early to tell.

Now does this dictate the total and absolute obsolescence of iSCSI and FC? No. There is no reason why a NVMe Fabrics storage solution cannot offer storage to hosts via FC, iSCSI, SMB 3, NFS, FCoE, … They, potentially could even offer iWarp, RoCE or Infiniband to the hosts so you won’t lose your prior investments or get locked into one. I have no magic ball so I cannot tell you if this will happen. What I do now that when it comes to MPIO versus multichannel for load balancing and even failover and recovery, multichannel sometimes does a (far?) superior job in my honest opinion especially with older hypervisors even when the hypervisor uses separate sessions per virtual machine to achieve better load balancing over iSCSI or the like. Anyway, I digress. One thing I do know is that I’ll keep a keen eye on what Microsoft is doing in this space, especially in regards to Windows Server 2016. It’s time to up the level on scalability & support for newer technologies once again.

Microsoft Ignite Here I Come


Ignite is coming closer and I’m to Chicago soon to attend. I’ll be focusing on a couple of things. One of them is vNext, that means Hyper-V and everything that’s related the network and the storage stack.  The other is Azure and anything related to the above mentioned stack as well as identity/security.

That should be sufficient to keep me busy as next to that I’ll be having meetings with the Microsoft product groups and various vendors/partner on their offerings and plans.

The remaining time will be allocated to networking and talking shop with the international community. I’m looking forward to meet up with so many buddies from across the globe and dive into our beloved subjects. I you read my blog, follow me on twitter and you’re there, let me know. We can meet and greet!

https://i0.wp.com/blog.symphonyiri.com/wp-content/uploads/Untitled-6.jpg

So let’s ignite the future of technology and prepare for our future as well. Remember, it’s you who needs to invest in yourself and you career. Employee, independent consultant or civil servant, it doesn’t matter, while helping others succeed, keep working on your own life long education and future.

But before I’m in Chicago I need to travel there, so we’ll hop onto one of those nice Boeings (I visited the factory, amazing experience) for a long haul flight across the big pond.  See you there!

image

DCB ETS Demo with SMB Direct over RoCE (RDMA)


It’s time to demonstrate ETS in action! There is a quick video on ETS on Vimeo to show what it look like.

I’m using Mellanox ConnectX-3 ethernet cards, in 2 node DELL PowerEdge R720 Hyper- cluster lab. We’ve configured the two ports for SMB Direct & set live migration to leverage them both over SMB Direct. For the purpose of this demo we’ll generate non RDMA over RoCE (TCP/IP) traffic over these two 10Gbps ports to simulate a problematic scenario where all bandwidth is already being used and to see how Enhanced Transmission Selection (ETS) will help in this scenario.  I have done this with DELL Force 10, PowerConnect 8100, N4000 series or a mix of both. This particular demo was leveraging PC8132Fs. I use what’s available to me in a lab at the time of writing.

To achieve the network load this we leverage ntttcp.exe to generate the non RDMA TCP/IP traffic. Using the Mellanox QoS counters we visualize this. In blue you see the sending traffic from node A, in red the receiving traffic on Node B. Note that this traffic is tagged with priority 1. We tag SMB Direct traffic with priority 4.

image

You can see that both Mellanox cards are running at full bandwidth, 2* 10Gbps from node A to node B and it’s all none RDMA traffic. Also note that I’m hitting all 16 physical cores (hyper threading is enabled). By doing so I avoid being bottlenecked by a singe core as in contrast to RDMA traffic there’s no huge CPU offload going on here.image

As these are the cards I have assigned to use for live migration (depending on the setup also  CSV or SOFS traffic) over SMB Direct you’ll see that the competition for bandwidth will be fierce if we don’t have a mechanism to guide this to a desired outcome. That’s exactly what we leverage DCB with PFC and ETS for.

So let’s kick off live migration of 4 virtual machines with 10GB of memory each. That should take about 20 seconds on 2 * 10Gbps cards. We first live migrate them form node B to Node A. That’s in the reverse direction of where we are sending TCP/IP traffic. You see 10Gbps being used all over and this is expected.

image

Remember that the network is full duplex. That means that you can send at 10Gbps (TCP/IP from node A to node B, RDMA from node B to A and vice versa) and receive at 10Gbps on a port. Actually if the backplane of the switch is powerful enough you can do so on all ports. So this is normal. Node A is sending TCP/IP traffic to node B at line speed and Node B is sending SMB Direct traffic to node A (the live migration) at line speed.

But what if we live migrate over SMB Direct in the same direction as the TCP/IP traffic is going, from node A to node B? Well have a look. To me this looks awesome.

image

ETS kicks in immediately. We configure the minimum bandwidth for SMB Direct Traffic to be 90%. Anything left after that (10%) is given to other traffic, in this demo the TCP/IP traffic we generated. As priority 4 tagged RoCE traffic is also configured to be lossless with PFC you don’t have to worry about dropping packets under contention. Now think about this and how you can steer your traffic behavior at times when the resources need to be divided amongst competing workloads.

I hope you now have a better idea on why QoS is useful, how it works and that it indeed does work. While I have taken the opportunity to demonstrate this with SMB Direct over RoCE I’d like to stress that QoS is not just about RoCE where it’s  “mandatory” due to the fact it requires at least PFC. It’s a very much a needed tool that’s very beneficial in any converged scenario and that the optional ETS might be a very good idea, depending on your environment.

Again, to get you a better idea, here’s a short, quick video on ETS on Vimeo.

DCB PFC Demo with SMB Direct over RoCE (RDMA)


In this blog post we’ll demo Priority Flow control. We’re using the demo comfit as described in SMB Direct over RoCE Demo – Hosts & Switches Configuration Example

There is also a quick video to illustrate all this on Vimeo. It’s not training course grade I know, but my time to put into these is limited.

I’m using Mellanox ConnectX-3 ethernet cards, in 2 node DELL PowerEdge R720 Hyper- cluster lab. We’ve configured the two ports for SMB Direct & set live migration to leverage them both over SMB Direct. For that purpose we tagged SMB Direct traffic with priority 4 and all other traffic with priority 1. We only made priority lossless as that’s required for RoCE and the other traffic will deal with not being lossless by virtue of being TCP/IP.

Priority Flow Control is about making traffic lossless. Well some traffic. While we’d love to live by Queens lyrics “I want it all, I want it all and I want it now” we are limited. If not so by our budgets, than most certainly by the laws of physics. To make sure we all understand what PFC does here’s a quick reminder: It tells the sending party to stop sending packets, i.e. pause a moment (in our case SMB Direct traffic) to make sure we can handle the traffic without dropping packets. As RoCE is for all practical purposes Infiniband over Ethernet and is not TCP/IP, so you don’t have the benefits of your protocol dealing with dropped packets, retransmission … meaning the fabric has to be lossless*. So no it DOES NOT tell non priority traffic to slow down or stop. If you need to tell other traffic to take a hike, you’re in ETS country 🙂

* If any switch vendor tells you to not bother with DCB and just build (read buy their switches = $$$$$) a lossless fabric (does that exist?) and rely on the brute force quality of their products to have a lossless experience … could be an interesting experiment Smile.

Note: To even be able to start SMB Direct SMB Multichannel must be enabled as this is the mechanism used to identify RDMA capabilities after which a RDMA connection is attempted. If this fails you’ll fall back to SMB Multichannel. So you will have ,network connectivity.

You want RDMA to work and be lossless. To visualize this we can turn to the switch where we leverage the counter statistics to see PFC frames being send or transmitted. A lab example from a DELL PowerConnect 8100/N4000 series below.

image

To verify that RDMA is working as it should we should also leverage the Mellanox Adapter Diagnostic and native Windows RDMA Activity counters. First of all make sure RDMA is working properly. Basically you want the error counters to be zero and stay that way.

Mellanox wise these must remain at zero (or not climb after you got it right):

  • Responder CQE Errors
  • Responder Duplicate Request Received
  • Responder Out-Of-Order Sequence Received
  • … there’s lots of them …

image

Windows RDMA Activity wise these should be zero (or not climb after you got it right):

  • RDMA completion Queue Errors
  • RDMA connection Errors
  • RDMA Failed connection attempts

image

The event logs are also your friend as issues will log entries to look out for like

PowerShell is your friend (adapt severity levels according to your need!)

Get-WinEvent -ListLog “*SMB*” | Get-WinEvent | ? { $_.Level -lt 4 -and $_. Message -like “*RDMA*” } | FL LogName, Id, TimeCreated, Level, Message

Entries like this are clear enough, it ain’t working!

The network connection failed.
Error: The I/O request was canceled.
Connection type: Rdma
Guidance:
This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks port 445 or 5445 can also cause this issue.
 
RDMA interfaces are available but the client failed to connect to the server over RDMA transport.
Guidance:
Both client and server have RDMA (SMB Direct) adaptors but there was a problem with the connection and the client had to fall back to using TCP/IP SMB (non-RDMA).

 

To view PFC action in Windows we rely on the Mellanox Adapter QoS Counters

image

Below you’ll see the number of  pause frames being sent & received on each port. Click on the image to enlarge.

image

An important note trying to make sense of it all: … pauze and receive frames are sent and received hop to hop. So if you see a pause frame being sent on a server NIC port you should see them being received on the switch port and not on it’s windows target you are live migrating from. The 4 pause frames sent in the screenshot above are received by the switchport as you can see from the PFC Stats for that port.

image

People, if you don’t see errors in the error counters and event viewer that’s good. If you see the PFC Pause frame counters move up a bit that’s (unless excessive) also good and normal, that PFC doing it’s job making sure the traffic is lossless. If they are zero and stay zero for ever you did not buy a lossless fabric that doesn’t need DCB, it’s more likely you DCB/PFC is not working Winking smile and you do not have a lossless fabric at all. The counters are cumulative over time so they don’t reset to zero bar resetting the NIC or a reboot.

image

When testing feel free to generate lots of traffic all over the place on the involved ports & switches this helps with seeing all this in action and verifying RDMA/PFC works as it should. I like to use ntttcp.exe to generate traffic, the most recent version will let you really put a load on 10GBps and higher NICs. Hammer that network as hard as you can Winking smile.

Again a simple video to illustrate this on Vimeo.

SMB Direct over RoCE Demo – Hosts & Switches Configuration Example


As mentioned in Where SMB Direct, RoCE, RDMA & DCB fit into the stack this post’s only function is to give you an overview of the configurations used in the demo blogs/videos. First we’ll configure one Windows Server 2012 R2 host. I hope it’s clear this needs to be done on ALL hosts involved. The NICs we’re configuring are the 2 RDMA capable 10GbE NICs we’ll use for CSV traffic, live migration and our simulated backup traffic. These are Mellanox ConnectX-3 RoCE cards we hook up to a DCB capable switch. The commands needed are below and the explanation is in the comments. Do note that the choice of the 2 policies, priorities and minimum bandwidths are for this demo. It will depend on your environment what’s needed.

#Install DCB on the hosts
Install-WindowsFeature Data-Center-Bridging
#Mellanox/Windows RoCE drivers don't support DCBx (yet?), disable it.
Set-NetQosDcbxSetting -Willing $False
#Make sure RDMA is enable on the NIC (should be by default)
Enable-NetAdapterRdma –Name RDMA-NIC1
Enable-NetAdapterRdma –Name RDMA-NIC2
#Start with a clean slate
Remove-NetQosTrafficClass -confirm:$False
Remove-NetQosPolicy -confirm:$False

#Tag the RDMA NIC with the VLAN chosen for PFC network
Set-NetAdapterAdvancedProperty -Name "RDMA-NIC-1" -RegistryKeyword "VlanID" -RegistryValue 110
Set-NetAdapterAdvancedProperty -Name "RDMA-NIC-2" -RegistryKeyword "VlanID" -RegistryValue 120

#SMB Direct traffic to port 445 is tagged with priority 4
New-NetQosPolicy "SMBDIRECT" -netDirectPortMatchCondition 445 -PriorityValue8021Action 4
#Anything else goes into the "default" bucket with priority tag 1 🙂
New-NetQosPolicy "DEFAULT" -default  -PriorityValue8021Action 1

#Enable PFC (lossless) on the priority of the SMB Direct traffic.
Enable-NetQosFlowControl -Priority 4
#Disable PFC on the other traffic (TCP/IP, we don't need that to be lossless)
Disable-NetQosFlowControl 0,1,2,3,5,6,7

#Enable QoS on the RDMA interface
Enable-NetAdapterQos -InterfaceAlias "RDMA-NIC1"
Enable-NetAdapterQos -InterfaceAlias "RDMA-NIC2"

#Set the minimum bandwidth for SMB Direct traffic to 90% (ETS, optional)
#No need to do this for the other priorities as all those not configured
#explicitly goes in to default with the remaining bandwith.
New-NetQoSTrafficClass "SMBDirect" -Priority 4 -Bandwidth 90 -Algorithm ETS

We also show you in general how to setup the switch. Don’t sweat the exact syntax and way of getting it done. It differs between switch vendors and models (we used DELL Force10 S4810 and PowerConnect 8100 / N4000 series switches), it’s all very alike and yet very specific. The important thing is that you see how what you do on the switches maps to what you did on the hosts.

!Disable 802.3x flow control (global pause)- doesn't mix with DCB/PFC
workinghardinit#configure
workinghardinit(conf)#interface range tengigabitethernet 0/0 -47 
workinghardinit(conf-if-range-te-0/0-47)#no flowcontrol rx on tx on
workinghardinit(conf-if-range-te-0/0-47)# exit
workinghardinit(conf)# interface range fortyGigE 0/48 , fortyGigE 0/52
workinghardinit(conf-if-range-fo-0/48-52)#no flowcontrol rx on tx off
workinghardinit(conf-if-range-fo-0/48-52)#exit

!Enable DCB & Configure VLANs
workinghardinit(conf)#service-class dynamic dot1p
workinghardinit(conf)#dcb enable
workinghardinit(conf)#exit
workinghardinit#copy running-config startup-config
workinghardinit#reload

!We use a <> VLAN per subnet
workinghardinit#configure
workinghardinit(conf)#interface vlan 110
workinghardinit (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
workinghardinit (conf-if-vl-vlan-id*)#tagged port-channel 3
workinghardinit(conf)#interface vlan 120
workinghardinit (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
workinghardinit (conf-if-vl-vlan-id*)#tagged port-channel 3
workinghardinit (conf-if-vl-vlan-id*)#exit


!Create & configure DCB Map Policy
workinghardinit(conf)#dcb-map SMBDIRECT
workinghardinit(conf-dcbmap-profile-name*)#priority-group 0 bandwidth 90 pfc on 
workinghardinit(conf-dcbmap-profile-name*)#priority-group 1 bandwidth 10 pfc off 
workinghardinit(conf-dcbmap-profile-name*)#priority-pgid 1 1 1 1 0 1 1 1
workinghardinit(conf-dcb-profile-name*)#exit 

!Apply DCB map to the switch ports & uplinks
workinghardinit(conf)#interface range ten 0/047
workinghardinit(conf-if-range-te-0/0-47)# dcb-map SMBDIRECT 
workinghardinit(conf-if-range-te-0/0-47)#exit
workinghardinit(conf)#interface range fortyGigE 0/48 , fortyGigE 0/52
workinghardinit(conf-if-range-fo-0/48,fo-0/52)# dcb-map SMBDIRECT
workinghardinit(conf-if-range-fo-0/48,fo-0/52)#exit
workinghardinit(conf)#exit
workinghardinit#copy running-config startup-config 

With the hosts and the switches configured we’re ready for the demos in the next two blog posts. We’ll show Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS) in action with some tips on how to test this yourselves.

Where SMB Direct, RoCE, RDMA & DCB fit into the stack


I’m assuming most of you are at least familiar with the concept of converged networking and SMB Multichannel and SMB Direct. This is not going to be a lesson on these subjects. We’re just setting the stage here for our simple demo configuration and its relation to real world scenarios. This to remind you of the why and where of what we do an demo in our next blog posts on SMB Direct over RoCE with two DCB features: Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS).

Generalized and simplified a modern virtualized data center network looks a lot like this:

image

It’s more or less converged, that means all kinds of traffic move over the same infrastructure, which is great for standardization and your budget. Unless you get into performance issues. That’s where QoS can help.  As we’re doing SMB Direct over RoCE we’ll use DCB to handle QoS. Mind you, QoS is an aid and it will not help to do too much over too little bandwidth. Let’s zoom in a bit on the Hyper-V & storage side of things. In general the RDMA capable variant of a  modern SOFS / Hyper-V environment network looks as below in a bit more detail:

image

The RDMA capable traffic is SMB Direct over RoCE in this use case. This is used for Live Migration, CSV Traffic & storage traffic to the SOFS Server.

DCB cannot distinguish between these SMB traffic uses cases. It’s all RDMA traffic over port 445 the DCB configuration will not distinguish between these. That’s why on top of DCB we leverage SMB Bandwidth Limit (see https://blog.workinghardinit.work/2013/09/03/preventing-live-migration-over-smb-starving-csv-traffic-in-windows-server-2012-r2-with-set-smbbandwidthlimit/). This prevents the live migration traffic form pushing aside the Storage traffic. This is a windows configured feature and does not rely on DCB or other forms of QoS.

To make sure cluster traffic itself, backups, data copies, management etc… don’t starve each other we implement QoS leveraging DCB (the ETS part). As we need to use DCB with RoCE in real worlds scenarios to make it lossless (the PFC part) and as you do not mix different QoS approaches on the same networks stack we stick with DCB for the other workloads on the same networks stack.

image

Mind you this does not prevent scenarios where management and backups are done over vNICs on the Hyper-V switch and where we leverage Hyper-V QoS as that’s on another network stack.

In our lab demos we’ll keep things simple: We’ll do live migration over SMB Direct (RoCE)and we’ll simulate intense backup traffic over the same pair of NICs to illustrate a RoCE configuration to guarantee minimal bandwidth for both and keep the RDMA traffic lossless (PFC). To make it very clear we’ll do a demo setup where we use two 10GbE NICs per host and allocate a minimum bandwidth of 90% for live migration and allocate the remaining10% minimum bandwidth to all other traffic (i.e. which includes our intense backup traffic). Read more about the configuration in SMB Direct over RoCE Demo – Hosts & Switches Configuration Example.