About workinghardinit

IT without the sales brochure gloss

10Gbps Cheap & Without Risk In Even The Smallest Environments


Over the last 18 months cheaper, commodity, small port count, but high quality 10Gbps switches have become available. NetGear is a prime example. This means 10Gbps networking is within reach for even the smallest deployments.

Size is an often used measure for technological needs like storage, networking and compute but in many cases it’s way too blunt of a tool. A lot of smaller environments in specialized niches need more capable storage  and networking capacities than their size would lead you to believe. The “Enterprise level” cost associated with the earlier SPF+ based swithes was an obstacle especially since the minimum port count lies around 24 ports, so with switch redundancy this already means 2 *24 ports.  Then there’s the cost of vendor branded SPF+ modules. But that could be offset with Copper Twinax Direct Attach cabling (which have their sweet spots for use) or finding functional cheaper non branded SFP+ modules. But all that isn’t an issue anymore. Today 10GBase-T card & switches are readily available and ready for prime time. The issues with power consumption and heat have been dealt with.

While vendors like DELL have done some amazing work to bring affordable 10Gbps switches to the market it remained a obstacle for many small environments. Now with the cheaper copper based, low port count switches it’s become a lot easier to introduce 10Gbps while taking away the biggest operational pains.

  • You can start with a lower number of 10Gbps ports (8-12) instead of  a minimum of 24.
  • No need for expensive vendor branded SPF+ modules.
  • Copper cabling (CAT6A) is relatively cheap for use in a rack or between two racks and for this kind of environment using patch lead cables isn’t an issue
  • Power consumption and heat challenges of copper 10Gbps has been addressed.

8port10Gbps

So even for the smallest setups where people would love to get 10Gbps for live migrations, hypervisor host backups and/or the virtual network it can be done now. If you introduce these for just CSV, live migration, storage or backup networks you can even avoid having to integrate them into the data network. This makes it easier, non disruptive & the isolation helps puts minds at easy about potential impacts of extra traffic and misconfigurations. Still you take away the heavy loads that might be disrupting your 1Gbps network, making things well again without needing further investments.

So go ahead, take the step and enjoy the benefits that 10Gbps bring to your (virtual) environment. Even medium sized shops can use this as a show case while they prepare for a 10Gbps upgrade for the server room or data center in the years to come.

NTFS Permissions On A File Server From Hell Saved By SetACL.exe & SetACL Studio


Most IT people don’t have a warm and fuzzy feeling when NTFS permissions & “ACLing” are being discussed. While you can do great & very functional things with it, in reality when dealing with file servers over time “stuff” happens. Some of it technical, most of it is what I’ll call “real life”. When it comes to file servers, real life, especially in a business environment, has very little respect, let alone consideration for NFTS/ACL best practices. So we all end up dealing with the fall out of this phenomena. If you haven’t I could state you’re not a real sys admin but in reality I’m just envious of your avoidance skills Smile.

You don’t want to fight NTFS/ACLs, but if it can’t be avoided you need the best possible knowledge about how it works and the best possible tools to get the job done (in that order).

If you have not heard of SetACL or DelProf2, you might also not have heard of uberAgent for Splunk, let alone of their creator, community rock star Helge Klein. If you new to the business I’ll forgive you but if you been around for a while you have to get to know these tools. His admin tools, both the free or the paying ones, are rock solid and come in extremely handy in day to day work. When the shit hits the fans they are priceless.

Helge is an extremely knowledgeable, experienced, talented and creative IT Professional and developer. I’ve met him a couple of times (E2EVC, where he’s an appreciated speaker) and all I can say is that on top of all that, he’s a great guy, with heart for the community.

Having the free SetACL.exe available for scripting of NTFS permissions is a luxury I cannot do without anymore. On top of that for a very low price you can buy SetACL Studio. This must be the most efficient GUI tool for managing NFTS permissions / ACLs I have ever come across.

Not long ago I was faced with a MBR to GPT LUN migration on a very large file server. It’s the proverbial file server from hell. We’ve all been there too many times and even after 15 years plus we still cannot get people to listen and follow some best practices and above all the KISS principle. So you end up having to deal with the fall out of every political, organizational, process and technical mistake you can imagine when it comes to ACLs & NTFS permissions. So what did I reach for? SetACL.exe and SetACL Studio, these are my go to tools for this.

image

Check out the web page to read up on what this tool can do for you. It very easy to use, intuitive and fast. It can do ACL on file systems, registry, services, printers and even WMI. It helps you deal with granting ownership and rights without messing up the existing NTFS permissions in an easy way. It works on both local and remote systems. Last but not least it has an undo function, how cool is that?!  Yup and admin tool that let you change your mind. Quite unique.

As an MVP I can get a license for free form Helge Klein but I recommend any IT Pro or consultant to buy this tool as it makes a wonderful addition to anyone’s toolkit, saving countless of hours, perhaps even days. It pays itself back within the 15 minutes you use it.

Other useful tools in your toolkit are http://www.editpadlite.com/ as it can handle the large (550-800 MB) log files RoboCopy can produce and some PowerShell scripting skills to parse these files.

Windows 2012 R2 Data Deduplication Leverages Shadow Copies: “LastOptimizationResultMessage : A volume shadow copy could not be created or was unexpectedly deleted”.


When you’re investigation and planning large repositories for data (backups, archive, file servers, ISO/VHD stores, …) and you’d like to leverage Windows Data Deduplication you have too keep in mind that the maximum supported size for an NTFS volume is 64TB. They can be a lot bigger but that’s the maximum supported. Why, well they guarantee everything will perform & scale up to that size and all NTFS functionality will be available. Functionality on like volume shadow copies or snapshots. NFTS volumes can not be lager than 64TB or you cannot create a snapshot. And guess what data deduplication seems to depend on?

Here’s the output of Get-DedupeStatus for a > 150TB volume:

image

Note “LastOptimizationResultMessage      : A volume shadow copy could not be created or was unexpectedly deleted”.

Looking in the Deduplication even log we find more evidence of this.

image

Data Deduplication was unable to create or access the shadow copy for volumes mounted at "T:" ("0×80042306"). Possible causes include an improper Shadow Copy configuration, insufficient disk space, or extreme memory, I/O or CPU load of the system. To find out more information about the root cause for this error please consult the Application/System event log for other Deduplication service, VSS or VOLSNAP errors related with these volumes. Also, you might want to make sure that you can create shadow copies on these volumes by using the VSSADMIN command like this: VSSADMIN CREATE SHADOW /For=C:

Operation:

   Creating shadow copy set.

   Running the deduplication job.

Context:

   Volume name: T: (\\?\Volume{4930c926-a1bf-4253-b5c7-4beac6f689e3}\)

Now there are multiple possible issues that might cause this but if you’ve got a serious amount of data to backup, please check the size of your LUN, especially if it’s larger then 64TB or flirting with that size. It’s temping I know, especially when you only focus on dedup efficiencies. But, you’ll never get any dedupe results on a > 64TB volume. Now you don’t get any warning for this when you configure deduplication. So if you don’t know this you can easily run into this issue. So next to making sure you have enough free space, CPU cycles and memory, keep the partitions you want to dedupe a reasonable size. I’m sticking to +/- 50TB max.

I have blogged before on the maximum supported LUN size and the fact that VSS can’t handle anything bigger that 64TB here Windows Server 2012 64TB Volumes And The New Check Disk Approach. So while you can create volumes of many hundreds of TB you’ll need a hardware provider that supports bigger LUNs if you need snapshots and the software needing these snapshots must be able to leverage that hardware VSS provider. For backups and data protection this is a common scenario. In case you ask, I’ve done a quick crazy test where I tried to leverage a hardware VSS provider in combination with Windows Server data deduplication. A LUN of 50TB worked just fine but I saw no usage of any hardware VSS provider here. Even if you have a hardware VSS provider, it’s not being used for data deduplication (not that I could establish with a quick test anyway) and to the best of my knowledge I don’t think it’s possible, as these have not exactly been written with this use case in mind. Comments on this are welcome, as I had no more time do dig in deeper.

Virtualizing Intensive Workloads on Hyper-V, Can It Be Done?


Can it be done?

All I can say is that, yes, absolutely, you can virtualize resource intensive workloads. Done right you’ll gain all benefits associated with virtualization and you won’t lose your performance & scalability.

Now I have to stress done right. There are a couple of major causes of problems with virtualization. So let’s look at those and see how a few well placed torpedoes can sink your project fast & effective.

Common Sense

One of them is the lack of common sense. If you currently have 10 SQL Servers with 12 15K RPM SAS Disks in RAID 1 and RAID 10 for the OS, TempDB, Logs & Data files, 64 GB of Memory, dual Quad Core sockets and teamed 1Gbps for resilience and throughput and you want to virtualize them you should expect to deliver the same resources to the virtualized servers. It’s technology people. Hoping that a hypervisor will magically create resources out of thin air is setting yourself up for failure. You cannot imagine how often people use cheap controllers, less disk or slower disks, less bandwidth or CPU cycles and then dump their workload on it. Dynamic memory, NUMA awareness, Storage QoS, etc. cannot rescue a undersized, ill conceived solution. I realize you have read that most physical servers are sitting there idle and let their resources go to waste. If you don’t measure this you can get bitten. You can get ripped to pieces when you’re dealing with virtualizing intensive workloads on Hyper-V based on assumptions.

image

Consider the entire stack

The second torpedo is not understanding the technology stack. The integration part of things or the holistic approach in management consulting speak. The times one could think as a storage admin, network admin, server admin, virtualization admin, SQL DBA, Exchange Engineer is long gone. Really, long gone. You need to think about the entire stack. Know your bottle necks, SPOF, weaknesses, capabilities and how these interact. If you’re still on premise for 100% that means you have to be a datacenter admin, not forgetting you might have multiple of those. And you’d better communicate a bit through DevOps to make sure the developers know that all those resources are not magically super redundant, are not continuously available without any limitation and that these do not have infinite scalability.

image

 

Drivers, firmware & bugs can sink your project

Hardware, VAR & ISV support is also a frequent cause of problems. They’ll al tell you that everything is supported. You can learn very fast and very painfully that this is too often not the case or serious bugs are wreaking havoc on your beautiful design. So I live by one of my mantras: “Trust but verify”. However sad it may be, you cannot in good faith trust OEM, VAR and ISVs. I’m not saying they are willfully doing this, but their experience, knowledge isn’t perfect & complete either. You have to do your due diligence. There are too many large scale examples of this right now with Emulex NIC issues around DVMQ. This is a prime example of how you slow acknowledgement of a real issue can ruin your virtualization project for intensive workloads and has been doing so for 9 months and might very well take a year to resolve. Due diligence could have saved you here. A VAR should protect its customers from that, but in reality they often find out when it’s too late. Another example is bugs in storage vendors implementation of ODX causing corruption or extremely slow support for a new version of Windows effectively blocking the use of it in production when you need it for the performance & scalability. I have long learned that losing customers and as such revenue is the only real language vendors understand. So do not be afraid to make hard decisions when you need to.

image

Knowledge & Due Diligence

Know your hypervisor and core technologies well. Don’t think it’s the same a hardware based deployments, don’t think all options and features work everywhere for everything, don’t think all hypervisor work the same. They do not. Know about Exchange and the rules/limits around virtualization. The same goes for SQL Server and any resource intensive workload you virtualize. Don’t think that the same rules apply to all workloads. There is no substitute for knowledge, experience and hands on testing, the verification part of trust but verify, remember? It goes for you as well!

image

It can be done

Yes, we can Winking smile! If you want to see some high level examples to simulate your appetite just browse my blog. Here are some pointers to get you started.

Unmap

 

 

Live migration at the speed of light

Remember , don’t just say “Damn those torpedoes, full speed ahead” but figure out why, where, when and how you’ll get the job done.

I’m Not Your (FREE) Personal Assistant


Volunteering in the community

As active community member and MVP I spend a lot of time and effort sharing information and experiences with the community. I also assist colleagues & peers across the globe when they have questions or issues I might be able to help them with. It’s part of sharing and caring. Just like my fellow community members & MVPs I blog, record video’s, web & screen casts, present at conferences & user groups. I hang out for the Ask The Experts moments of opportunity at both local and international. When possible I also attend the ChalkTalks nights like the one that local user group WinTalks organizes where people can bring their questions or problems to discuss.

The impossibility of answering the questions

I share a lot of information, ideas, opinions and experiences. Asking me directly, repeatedly, to give you quick & fast solutions for your current issues, problems and consulting challenges is not the way to go however. For one the complexity of the issues and the situation as exists is often ignored in these question. So it’s impossible to answer them in that fashion.

Also, as is the case with most of us, I’m a very, very busy man. A tremendous amount of knowledge many of my peers and I share is freely available to the community and we absolutely love doing that. If you ask a question on a blog post or contact me I will try and answer if it’s not too much work & is relevant to the blog post. It benefits everyone to see the question and the answer. But for real support you have forums and vendors service desks that are a lot better suited and have dedicated staff or thousands of volunteer eyes. For consulting engagements to solve the complex issues you’re running into you’ll just have to hire the expertise or make me an offer way too good to decline. When hiring expertise, you do get what you pay for if you do it smart. I’m not to blame and will not pay the bill for your previous bad hires, pseudo experts, marketing based decisions that got people into a pickle.

Keeping it real

We all have jobs with lots of work that we need to do to pay the bills. So we cannot be a free support desk, ad interim engineer, consultant or strategic advisor. This means e-mails and DMs with consulting questions or easily searchable questions are ignored unless the problem is personally interesting to me as a learning experience or it’s indeed “the opportunity of a life time”. The latter is highly unlikely.

You need to realize that you need to design your solutions to whatever level of complexity you can handle or afford. Many make this mistake. I understand all the issues around acquiring, building, maintaining, retaining & hiring expertise. Really I do, I do not live under a rock in the wilderness. It’s hard to find expertise and it’s hard to market expertise. So basically we end up with “best practices” & partially mediocrity. For good reason, that’s where you have to be and stay if you’re not willing/capable to pay for expertise. For a lot of commodity solutions that’s how it should be.

If you need better support & consultants than you currently have you should really consider hiring some of my fellow MVPs via their companies but don’t be surprised to be paying anything from € 200/hour and up for proven highly skilled experts for short very specialized assignments. Don’t balk at this, Ever hired MCS? Or a plumber? Right, these people are true consultants, not what passes for them nowadays but what is actually contracting or body shopping. Nothing wrong with temporary augmentation of your labor force, but is not high expertise consulting. Microsoft PFE/MCS aren’t expensive for the value they provide and the time and effort they put in. Next time you need to pay a plumber after a DIY project has gone wrong you’ll realize this.

You don’t have to engage experts. But if you do, you’ll need to bring a big wallet. You need to understand that your unwillingness to pay does not dictated rates, let alone value. Banks, doctors, shops, government … they only accept money and they laugh at me when I tell them I’d like to pay with some ones else’s gratitude.

Some of the people in my network know I have helped many in the past and know that I do this as a service to the community and learning experience. That benefits everyone out there, just like I benefit from them. That’s my choice, in my personal free time. I can assure you that neither those people or I  take this sort of help for granted, let alone demand it.

I can’t fix you being stupid, lazy, cheap or any combination of the above.

  • You’ll have to do your own searching of the internet via Bing or Google for you.
  • You’ll have to read the articles, blog & documentation.
  • You’ll have to analyze your own issues and come up with an plan of action.
  • You need to realize that developing yourself and skillsets is a time consuming, sustained effort. I understand you have other priorities, but that doesn’t mean I have to pick up the slack and put my own aside.
  • You’ll need to face reality. If your business needs something, they’ll need to make sure they are profitable enough to afford it.

Setting Up A Uplink (Trunk/General) With A Dell PowerConnect 2808 or 28XX


Introduction

I was deploying a bunch of PowerConnect 2808 switches that needed to provide connectivity to multiple VLANs  (Training, Guest, …)  in a class rooms. I should have figured it out before I got there with my “assumption” based quick configuration loaded on the switches if I had just refreshed my insights in how the PowerConnect family of switches work.

image

So before we go on, here are the basics on switch port (or LAG) modes in the PowerConnect family. Please realize that switch behavior (especially for trunk mode in this context) has changed over time with more recent switches/firmware. But the current state of affairs is as follows (depending on what model & firmware you have behavior differs a bit).You can put your port or LAG in the following 3 (main) modes:

Access: The port belongs to a single untagged VLAN. When a port is in Access mode, the packet types which are accepted on the port cannot be designated. Ingress filtering cannot be enabled/disabled on an access port. So only untagged received traffic is allowed and all transmitted traffic is untagged. The setting of the port determines the VLAN of traffic. Tagged received traffic is dropped. Basically, this is what you set your ports for client devices to (printer, PC, laptop, NAS).

Trunk: In older versions this means that ALL transmitted traffic is tagged.  That’s easy. Tagged received traffic is dropped if doesn’t belong to one of the defined VLAN on the trunk. In more recent switches/firmware untagged received traffic is dropped but for one VLAN, that can be untagged and still be received. Which is nice for the default VLAN and makes for a better compatibility with other switches.

General: You determine what the rules are. You can configure it to transmit tagged or untagged traffic per VLAN. Untagged received traffic is accepted and the PVID determines the VLAN it is tagged with.  Tagged received traffic is dropped if doesn’t belong to one of the defined VLANs.

Also see this DELL link PowerConnect Common Questions Between Access, General and Trunk mode

The PowerConnect 28XX Series

These  are good switches for their price point & use cases. Just make sure you buy them for the right use case. There is only one thing I find unforgiving in this day and age: the lack of SSH/HTTPS support for management.

Go ahead fire up a 2808 and take a look at the web interface and see what you can configure. In contrast with the PC54XX/55XX etc. Series you cannot set the port mode it seems. So how can this switch accommodate trunks/general/access modes at all. Well it’s implied in the configuration of ports that seem to be set in general mode by default and you cannot change that. The good news is that with the right setting a port in general mode behaves like a port in access or trunk mode. How? Well we follow the rules above.

So we assume here that a port is in general mode (can’t be changed). But we want trunk mode, so how do we get the same behavior? Let’s look at some examples in speudo CLI. (It’s web GUI only device).

Example 1: Classic Trunk = only defined tagged traffic is accepted. All untagged traffic is dropped

switchport mode trunk
switchport trunk allowed vlan add 9, 20

So we can have the same behavior is general mode using

switchport mode general
switchport general allowed vlan add 9, 20 tagged
switchport general pvid 4095   

The PVID  of 4095 is the industry standard discard VLAN, it assign this VLAN to all untagged traffic which is dropped. Ergo this is the same as the trunk config above!

Example 2: Modern Trunk = only defined tagged traffic and one untagged VLAN is accepted

switchport mode trunk
switchport trunk allowed vlan add 9, 20
switchport trunk allowed vlan add 1 untagged

So we can have the same behavior is general mode using

switchport mode general
switchport general allowed vlan add 9, 20 tagged
switchport general pvid 1  

This example is what we needed in the classroom. And is basically what you set with the GUI. So far so good. But we ran into an issue with connectivity to the access ports in VLAN 9 and VLAN 20. Let’s look at that in the next Example

Example 3: Access port mode = only one untagged VLAN is accepted

switchport mode access
switchport access vlan 9

Switchport mode general
switchport general allowed vlan add 9 untagged
switchport general pvid 9

If you’re accustomed to the higher end PC switches you define the port in access mode and add the VLAN of you choice untagged. That’s it. Here the mode is general and can’t be changed meaning we need to set the PVID to 9 so all untagged traffic is indeed tagged with VLAN 9 on the port.

Setting Up an uplink between a PowerConnect 5548 and a 2808

Here’s the normal deal with higher range series of PowerConnect switches: you normally use the port mode to define the behavior and in our case we could go with a trunk or general mode. We use trunk, leave the native VLAN for the one untagged VLAN and add 9 and 20 as tagged VLANs.

The “trunk” port of LAG is left on the default PVID

image

So an “access” port for VLAN 9 is is achieved by setting the PVID to 9

image

And an “access” port for VLAN 20 is achieved by setting the PVID to 20

image

While the VLAN  membership settings are what you’d expect them to be like on the higher end PowerConnect models:

VLAN 1 (native)

image

VLAN 9 (Corp)

image

VLAN 20 (Guest)

image

If it’s the first time configuring a PC2808 you might  totally ignore the fact that needed to do some extra work to make traffic flow. So to recap what you need to do  As described above there is no selection of access/general/trunk … on a PowerConnect 2808. The port or the lag is “implicitly” set to general and the extra settings of the PVID and adding tagged/untagged VLANs will make it behave as general, trunk or access.

  • The trick is to set any other VLAN than the default 1 to tagged on the port or LAG you’ll use as uplink. So far things are quite “standard PowerConnect”.
  • You set the VLAN membership of your “access” ports to untagged to the VLAN you want them to belong to.
  • After that in on the “access” ports you set the PVID to the VLAN you want the port to belong to. If you do not do this the port still behaves as if it’s a VLAN 1 port. It will not get a DHCP address for that VLAN but for for the the one on VLAN 1 if there  is one, or, if you use a static IP address for the subnet of a VLAN on that port you won’t have connectivity as it’s not set to the right VLAN.

The reason we used the PowerConnect 2808 series here is that we needed silent ones (passive cooling) and they need multiple ones in the training rooms to avoid to many cables running around the place. That was the 2 minutes at the desk of the project managers quick fix to a changed requirement. The real solution of cause would have been to get 24+ outlets to the room in the correct places and add 24+ ports to the normal switch count in the hardware analysis for the building solution. But after the facts you have to roll with the flow.

The Hyper-V Amigos Showcast Episode 4: TechEd North America 2014


In episode 4 the original Hyper-V amigos (also 4) get together for a chat. Yes, learn about the history of the name and about the what happened at TechEd North America 2014. How Aidan won speaker idol. How I got to be on stage.

image

Hans is a bit tired but extremely happy due to a certain soccer game outcome Smile. The orange shirt is not by accident. We discuss the keynote, the content, Azure announcements … we jump into one of our favorite topics storage and storage spaces and speculate a bit about vNext timing.

Enjoy!

Live Migration Speed Check List – Take It Easy To Speed It Up


When configuring live migrations it’s easy to go scrounge on all the features and capabilities we have in Windows Server 2012 R2.

There is no one stopping you configuring 50 simultaneous live migrations. When you have only one, two or even four 1Gbps NICs at your disposal,  you might stick to 1 or 2 VMs per available 1Gbps. But why limit yourself if you have one or multiple 10Gbps pipes or bigger ready to roll? Well let’s discuss a little what happens when you do a live migration on a Hyper-V cluster with CSV storage. Initiating a live migrations kicks of a slew of activities.

  1. First it is establish form where (aka the source host) to where we are migrating (aka the target host).
  2. Permissions are checked, are we allowed to do this?
  3. Do we have enough memory on the target to do this? If so allocate that memory.
  4. Set up a skeleton VM on the target host that is a perfect copy of the source VM’s  specifications and configure dependencies on the target host.
  5. Let’s see if we can get a network connection set up and running. If that works, we’re cool and can now transfer the memory.
  6. A bitmap is created to track the changes to the memory pages of the source VM’s pages. Each memory page is copied from the source host to the target host VM during which the memory page is marked clean.
  7. As long as the source VM is running memory is changing, which continues to be tracked in the bitmap and as such that page is mapped as dirty over there. In an iterative process this dirty memory is copied over again and so on. This continues until the remaining dirty memory is minimal. This will take longer if the VM is very memory intensive.
  8. The tiniest amount of not yet copied dirty memory is that part of a VMs state that is copied during “black out”. For this to happen the VM on the source host is paused, the remaining state is copied.
  9. A final check is done to confirm all is well and then the virtual machine is resumed on the target host.
  10. Any remains of the VM on the source host are cleaned up.

That’s actually a lot of work and as you can see copying the state is just part of the process. The more bandwidth & the lower the latency we throw at this part of the process becomes less of the total time spent during live migration.

If you can’t fill of just fill the bandwidth of your 10/40/46Gbps pipe or pipes & you operate at line speed, what’s left as overhead? Everything that’s not actual the copy of VM state. The trick is to keep the host busy so you minimize idle time of the network copies. I.e we want to fill up that bandwidth just right but  not go overboard otherwise  the work to manage a large number of multiple live migrations might actually slow you down. Compare it to juggling with balls. You might be very good and fast at it but when you have to many balls to attend to you’ll get into trouble because you have to spread you attention to wide, i.e. you’re doing more context switching that is optimal.

So tweaking the number of simultaneous live migrations to your environment is the last step in making sure a node is drained as fast as possible. Slowing things down can actually speed things up.  So when you get your 10Gbps or better pipes in production it pays of to test a bit and find the best settings for your environment.

Let’s recap all of the live migration optimization tips I have given over the years and add a final word of advice.  Those who have been reading my blog for a while know I enjoy testing to find what works best and I do tweak settings to get best performance and results. However you have to learn and accept that it makes no sense in real life to hunt for 1% or 2% reduction in live migration speeds. You’ll get one off  hiccups that slow you down more than that.

So what you need to do is tweak the things that matter the most and will get you 99% results?

  • Get the biggest pipe you need & can afford. Bigger pipes are always better than lots of aggregated smaller pipes when it come to low latency & high throughput.
  • Choose the best performance settings Hyper-V offers you. You can choose from TCP/IP,Compression, SMB. Ben Armstrong has a blog post on this Faster Live Migration–Which Option Should You Choose? I’d like to add that you can use NIC teaming for live migration as well and prior to Windows Server 2012 R2 that was the only way to aggregate bandwidth. Now you have more options. I prefer SMB but when I don’t have 10Gbps at my disposal I have found that compression really makes a difference. In my home  lab where I have only 1Gbps, the horror, it stopped me from going crazy Smile (being addicted to 10Gbps).

image

  • Optimize the power settings for your server BIOS if you want an extra speed & smoothness with 10Gbps (less so with 1Gbps). Look here An Early Look At Live Migration Over TCP/IP & Multichannel In Windows Server 2012 R2 Preview, the network traffic is a lot more stable, i.e. a flat line!  In Windows 2008 R2 this was a real need for 10Gbps or you’d be stuck at 16% max.
  • Enable Jumbo Frames for another 15-20%. Thanks to Multi Channel I can visualize this now. See also this blog post Live Migration Can Benefit From Jumbo Frames. The pictures say it all!
  • Figure out the best number of simultaneous live migrations in your environments. Well you just read this blog, so now you know.  Start at 4 and experiment upwards. Tune it back down if the speed deteriorates. The “best” number depends on your environment.

If you do these 5 things you’ll have really gotten the best performance out of your infrastructure that’s possible for live migration. Bar compression, which is not magic either but reducing the GB you need to transport at the cost of CPU cycles, you just cannot push more than 1.25GB/s trough a single 10Gbps pipe and so on. You might keep looking to grab another 1% or 2% improvement left and right  but might I suggest you have more pressing issues to attend to that, when fixed are a lot more rewarding? Knocking 1 or 2 seconds of a 100 second host evacuation is not going to matter, it’s a glitch. Stop, don’t over engineer it, don’t IBM it, just move on. If you don’t get top performance after tweaking these 5 settings you should look at all the moving parts involved between the host as the issue is there (drivers, firmware, cables, switch configurations, …) as you have a mistake or problem somewhere along the way.

DELL Has Great Windows Server 2012 R2 Feature Support – Consistent Device Naming–Which They Help Develop


The issue

Plug ‘n Play enumeration of devices has been very useful for loading device drivers automatically but isn’t deterministic. As devices are enumerated in the order they are received it will be different from server to server but also within the system. Meaning that enumeration and order of the NIC ports in the operating system may vary and “Local Area Connection 2” doesn’t always map to port 2 on the  on board NIC. It’s random. This means that scripting is “rather hard” and even finding out what NIC matches what port is a game of unplugging cables.

Consistent Device Naming is the solution

A mechanism that has to be supported by the BIOS was devised to deal with this and enable consistent naming of the NIC port numbering on the chassis and in the operating system.

But it’s even better. This doesn’t just work with on board NICs. It also works with add on cards as you can see. In the name column it identifies the slot in which the card sits and numbers the ports consistently.

In the DELL 12th Generation PowerEdge Servers this feature is enabled by default. It is not in HP servers for some reason, you need to turn in it on manually.

I first heard about this feature even before Windows Server 2012 Beta was released but as it turns out Dell has been involved with the development of this feature. It was Dell BIOS team members that developed the solution to consistently name network ports and had it standardized via PCI SIG.  They also collaborated with Microsoft to ensure that Windows Server 2012 would support all this.

Here’s a screen shot of a DELL R720 (12th Generation PowerEdge Server) of ours. As you can see the Consistent Device Naming doesn’t only work for the on broad NIC card. It also does a fine job with add on cards of which we have quite a few in this server.image

It clearly shows the support for Consistent Device Naming for the add on cards present in this server. This is a test server of ours (until we have to take it into production) and it has a quad 1Gbps Intel card, a dual Intel X520 DA card and a dual port Mellanox 10Gbps RoCE card. We use it to test out our assumptions & ideas. We still need a Chelsio iWarp card for more testing mind you Winking smile

A closer look

This solution is illustrated the in the “Device Name column” in the screen shot below. It’s clear that the PnP enumerated name (the friendly name via the driver INF file) and the enumerated number value are very different from the number in Name column ( NIC1, NIC2, NIC2, NIC4) even if in this case where by change the order is correct. If the operating system is reinstalled, or drivers changed and the devices re-enumerated, these numbers may change as they did with previous operating systems.

image

The “Name” column is where the Consistent Device Naming magic comes to live. As you can see you are able to easily identify port names as they are numbered consistently, regardless of the “Device Name” column numbering and in accordance with the numbering on the chassis or add on card. This column name will NEVER differ between identical servers of after reinstalling a server because it is not dependent on PnP. Pretty cool isn’t it! Also note that we can rename the Name column and if we choose we can keep the original name in that one to preserve the mapping to the physical hardware location.

In the example below thing map perfectly between the Name column and the Device Name column but that’s pure luck.image

On of the other add on cards demonstrates this perfectly.image

Fixing Two Small DELL Compellent Hardware Hiccups


Here’s two little tips to solve some small hardware issues you might run into with a Compellent SAN. But first, you’re never on your own with CoPilot support. They are just one phone call away so I suggest if you see these to minor issues you give them a call. I speak from experience that CoPilot rocks. They are really good and go the extra mile. Best storage support I have ever experienced.

Notes

  • Always notify CoPilot as they will see the alerts come in and will contact you for sure Smile. Afterwards they’ll almost certainly will do a quick health check for you. But even better during the entire process they keep an eye on things to make sure you SAN is doing just fine. And if you feel you’d like them to tackle this, they will send out an engineer I’m sure.
  • Note that we’re talking about the SC40 controllers & disk bays here. The newer genuine DELL hardware is better than the super micro ones.

The audible alert without any issues what so ever

We kept getting an audible alert after we had long solved any issues on one of the SANs. The system had been checked a couple of times and everything was in perfect working order. Except for that audible alarm that just didn’t want to quit. A low priority issue I know but every time we walk into the data center we were going “oh oh” for a false alert. That’s not the kind of conditioning you want. Alerts are only to be made when needed and than they do need to be acted upon!

Working on this with CoPilot support we got rid of it by reseating the upper I/O module. You can do this on the fly – without pulling SAS-cables out or so, they are redundant, as long as you do it one by one and the cabling is done right (they can verify that remotely for you if needed).

image

But we got lucky after the first one. After the “Swap Clear” was requested  every warning condition was cleared and we got rid of the audible alert beep!  Copilot was on the line with us and made sure all paths are up and running so no bad things could happen. That’s what you have a copilot for.

Front panel display dimming out on a Compellent Disk Bay

We have multiple Compellent SANs and on one of those we had a disk bay with a info panel that didn’t light up anymore. A silly issue but an annoying one as this one also show you the disk bay ID.

image

Do we really replace the disk bay to solve this one? As that light had come on and of a couple of time it could just be a bad contact so my colleague decided to take a look. First  he removed the protective cover and then, using some short & curved screw drivers, he took of the body part. The red arrow indicates the little latch that holds the small ribbon cable in place.

image

That was standing right open. After locking that down the info appeared again on the panel. The covers was screwed on again and voila. Solved.