A reality Check On Disaster Recovery & Business Continuity


Introduction

Another blog post in “The Dilbert Life Series®” for those who are not taking everything personal. Every time business types start talking about business continuity, for some reason, call it experience or cynicism, my bull shit & assumption sensors go into high alert mode. They tend to spend a certain (sometimes considerable) amount of money on connectivity, storage, CPUs at a remote site, 2000 pages of documentation and think that covers just about anything they’ll need. They’ll then ask you when the automatic or 5 minute failover to the secondary site will be up and running. That’s when the time has come to subdue all those inflated expectations and reduce the expectation gap between business and IT as much as possible. It should never have come to that in the first place. But in this matter business people & analysts alike, often read (or are fed) some marchitecture docs with a bunch of sales brochures which make it al sound very easy and quickly accomplished. They sometimes think that the good old IT department is saying “no” again just because they are negative people who aren’t team players and lack the necessary “can do attitude” in world where their technology castle is falling down. Well, sorry to bust the bubble, but that’s not it. The world isn’t quite that black and white. You see the techies have to make it work and they’re the ones who have to deal with the real. Combine the above with a weak and rather incompetent IT manager bending over to the business (i.e. promising them heaven on earth) to stay in there good grace and it becomes a certainty they’re going to get a rude awakening. Not that the realities are all that bad. Far from it, but the expectations can be so high and unrealistic that disappointment is unavoidable.

The typical flow of things

The business is under pressure from peers, top management, government & regulators to pay attention to disaster recovery. This, inevitably leads to an interest in business continuity. Why, well we’re in a 24/7 economy and your consumer right to buy a new coffee table on line at 03:00 AM on a Sunday night is worth some effort.  So if we can do it for furniture we should certainly have it for more critical services. The business will hear about possible (technology) solutions and would like to see them implemented. Why wouldn’t they? It all sounds effective and logical. So why aren’t we all running of and doing it? Is it because IT is a bunch of lazy geeks playing FPS games online rather than working for their mythically high salaries? How hard can it be? It’s all over the press that IT is a commodity, easy, fast, dynamic and consumer driven so “we” the consumers want our business continuity now! But hey it costs money, time, a considerable and sustained effort and we have to deal with the less than optimal legacy applications (90% of what you’re running right now).

Realities & 24/7 standby personnel

The acronyms & buzz words the business comes up with after attending some tech briefing by Vendors Y & Z (those are a bit like infomercials but without the limited value those might have Sarcastic smile) can be quite entertaining. You could say these people at least pay attention to the consumerized business types. Well actually they don’t, but they do smell money and lots of it. Technically they are not lying. In a perfect world things might work like that … sort of, some times and maybe even when you need it. But it will really work well and reliable. Sure that’s not the vendors fault. He can’t help  that the cool “jump of a cliff” boots he sold you got you killed. Yes they are designed to jump of a cliff but anything above 1 meter without other precautions and technologies might cause bodily harm or even death. But gravity and its effects in combination with the complexity of your businesses are beyond the scope of their product solutions and are entirely your responsibility. Will you be able to cover all those aspects?

Also don’t forget the people factor. Do you have the right people & skill sets at your disposal 24/7 for that time when disaster strikes? Remember that could be on a hot summer night in a weekend when they are enjoying a few glasses of wine at a BBQ party and not at 10:15 AM on a Tuesday morning.

So what terminology flies around?

They hear about asynchronous or even synchronous replication of storage of applications. Sure it can work within a data center, depending on how well it is designed and setup. It can even work between data centers, especially for applications like Exchange 2010. But let’s face it, the technical limitations and the lack of support for this in many of the legacy applications will hinder this considerably.

They hear of things like stretched clusters and synchronous storage replication. Sure they’ll sell you all kinds of licensed features to make this works at the storage level with a lot of small print. Sometimes even at the cost of losing functionality that makes the storage interesting in the first place. At the network level anything below layer 3 probably suffers from too much optimism. Sure stretched subnets seem nice but … how reliable are these solutions in real live?

Consider the latency and less reliable connectivity.You can and will lose the link once in a while. With active-active or active-passive data centers that depend on each other both become single points of failure. And then there are all the scenarios where only one part of the entire technology stack that makes everything work fails. What if the application clustering survives but not the network, the storage or the database? You’re toast any way. Even worse, what if you get into a split brain scenario and have two sides writing data. Recover from that one my friend, there’s no merge process for that, only data recovery. What about live migration or live motion (state, storage, shared nothing) across data centers to avoid an impending disaster? That’s a pipe dream at the moment people. How long can you afford for this to take even if your link is 99.999% reliable? Chances are that in a crisis things need to happen vast to avoid disaster and guess what even in the same data center, during normal routine operations, we’re leveraging <1ms latency 10Gbps pipes for this. Are we going to get solutions that are affordable and robust? Yes, and I think the hypervisor vendors will help push the entire industry forward when I see what is happening in that space but we’re not in Walhalla yet.

Our client server application has high availability capabilities

There are those “robust and highly available application architectures” (ahum) that only hold true if nothing ever goes wrong or happens to the rest of the universe. “Disasters” such as the server hosting the license dongle that is rebooted for patching. Or, heaven forbid, your TCP/IP connection dropped some packages due to high volume traffic. No we can’t do QoS on the individual application level and even if we could it wouldn’t help. If your line of business software can’t handle a WAN link without serious performance impact or errors due to a dropped packet, it was probably written and tested on  <1ms latency networks against a database with only one active connection. It wasn’t designed, it was merely written. It’s not because software runs on an OS that can be made highly available and uses a database that can be clustered that this application has any high availability, let alone business continuity capabilities. Why would that application be happy switching over to another link. A link that is possibly further away and running on less resources and quite possibly against less capable storage? For your apps to works acceptably in such scenarios you would already have to redesign them.

You must also realize that a lot of acquired and home written software has IP addresses in configuration files instead of DNS names. Some even have IP addresses in code.  Some abuse local host files to deal with hard coded DNS names … There are tons of very bad practices out there running in production. And you want business continuity for that? Not just disaster recovery  to be clear but business continuity, preferably without dropping one beat. Done any real software and infrastructure engineering in your life time have you? Keeping a business running often looks like a a MacGyver series. Lots creativity, ingenuity, super glue, wire, duct tape and Swiss army knife or multi tool. This is still true today, it doesn’t sound cool to admit to it, but it needs to be said.

We can make this work with the right methodologies and strict processes

Next time you think that, go to the top floor and jump of, adhering to the flight methodologies and strict processes that rule aerodynamics. After the loud thud due to you hitting the deck, you’ll be nothing more than a pool of human waste. You cannot fly. On top of unrealistic scenarios things change so fast that documentation and procedures are very often out of date as soon as they are written.

Next time some “consultants” drop in selling you products & processes with fancy acronyms proclaiming rigorous adherence to these will safe the day consider the following. They make a bold assumption given the fact they don’t know even 10% of the apps and processes in your company. Even bolder because they ignore the fact that what they discover in interviews often barely scratches the surface. People can only tell you what they actually know or dare tell you. On top of that any discovery they do with tools is rather incomplete. If the job consist of merely pushing processes and methodologies around without reality checks you could be in for a big surprise. You need the holistic approach here, otherwise it’s make believe. It’s a bit like paratrooper training for night drops over enemy strong holds, to attack those and bring ‘m down. Only the training is done in a heated class room during meetings and on a computer. They do not ever put on all their gear, let alone jump out of an aircraft in the dead of night, regroup, hump all that gear to the rally points and engage the enemy in a training exercise. Well people, you’ll never be able to pull of business continuity in real life either if you don’t design and test properly and keep doing that. It’s fantasy land. Even in the best of circumstances no plan survives it first contact with the enemy and basically you would be doing the equivalent of a trooper firing his rifle for the very first time at night during a real engagement. That’s assuming you didn’t break your neck during the drop, got lost and managed to load the darn thing in the first place.

You’re a pain in the proverbial ass to work with

Am I being to negative? No, I’m being realistic. I know reality is a very unwelcome guest in fantasy land as it tends to disturb the feel good factor. Those pesky details are not just silly technological “manual labor” issues people. They’ll kill your shiny plans, waste tremendous amounts of money and time.

We can have mission critical applications protected and provide both disaster recovery and business continuity. For that the entire solution stack need to be designed for this. While possible, this makes things expensive and often only a dream for custom written and a lot of the shelf software. If you need business continuity, the applications need to be designed and written for it. If not, all the money and creativity in the world cannot guarantee you anything. In fact they are even at best ugly and very expensive hacks to cheap and not highly available software that poses as “mission critical”.

Conclusion

Seriously people, business continuity can be a very costly and complex subject. You’ll need to think this through. When making assumptions realize that you cannot go forward without confirming them. We operate by the mantra “assumptions are the mother of al fuckups” which is nothing more than the age old “Trust but verify” in action. There are many things you can do for disaster recovery and business continuity. Do them with insight, know what you are getting into and maybe forget about doing it without one second of interruption for your entire business.

Let’s say disaster strikes and the primary data center is destroyed. If you can restart and get running again with only a limited amount of work and productivity lost, you’re doing very well. Being down for only a couple of hours or days or even a week, will make you one of the top performers. Really! Try to get there first before thinking about continuous availability via disaster avoidance and automatic autonomous failovers.

One approach to achieve this is what I call “Pandora’s Box”. If a company wants to have business continuity for its entire stack of operations you’ll have to leave that box closed and replicate it entirely to another site. When you’re hit with a major long lasting disaster you eat the down time and loss of a certain delta, fire up the entire box in another location. That way you can avoid trying to micro manage it’s content. You’ll fail at that anyway. For short term disasters you have to eat the downtime. Deciding when to fail over is a hard decision. Also don’t forget about the process in reverse order. That’s another part of the ball game.

It’s sad to see that more money is spend consulting & advisers daydreaming than on realistic planning and mitigation. If you want to know why this is allowed to happen there’s always my series on The do’s and don’ts when engaging consultants Part I and Part II. FYI, the last guru I saw brought into a shop was “convinced” he could open Pandora’s Box and remain in control. He has left the building by now and it wasn’t a pretty sight, but that’s another story.

Belgian TechDays 2013 Sessions Are On Line


Just a short heads up to let you all know that the sessions of the TecDays 2013 in Belgium are available on the TechNet site. The slide decks can be found on http://www.slideshare.net/technetbelux

In case you want to see my two sessions you can follow these links:

Now there are plenty more good sessions so I encourage you to browse and have a look. Kurt Roggen his session on PowerShell is a great one to start with.

Windows Server 2012 NIC Teaming Mode “Independent” Offers Great Value


There, I said it. In switching, just like in real life, being independent often beats the alternatives. In switching that would mean stacking. Windows Server 2012 NIC teaming in Independent mode, active-active mode makes this possible. And if you do want or need stacking for link aggregation (i.e. more bandwidth) you might go the extra mile and opt for  vPC (Virtual Port Channel a la CISCO) or VTL (Virtual Link Trunking a la Force10 – DELL).

What, have you gone nuts? Nope. Windows Server 2012 NIC teaming gives us great redundancy with even cheaper 10Gbps switches.

What I hate about stacking is that during a firmware upgrade they go down, no redundancy there. Also on the cheaper switches it often costs a lot of 10Gbps ports (no dedicated stacking ports). The only way to work around this is by designing your infrastructure so you can evacuate the nodes in that rack so when the stack is upgraded it doesn’t affect the services. That’s nice if you can do this but also rather labor intensive. If you can’t evacuate a rack (which has effectively become your “unit of upgrade”) and you can’t afford the vPort of VTL kind of redundant switch configuration you might be better of running your 10Gbps switches independently and leverage Windows Server 2012 NIC teaming in a switch independent mode in active active. The only reason no to so would be the need for bandwidth aggregation in all possible scenarios that only LACP/Static Teaming can provide but in that case I really prefer vPC or VLT.

Independent 10Gbps Switches

Benefits:

  • Cheaper 10Gbps switches
  • No potential loss of 10Gbps ports for stacking
  • Switch redundancy in all scenarios if clusters networking set up correctly
  • Switch configuration is very simple

Drawbacks:

  • You won’t get > 10 Gbps aggregated bandwidth in any possible NIC teaming scenario

Stacked 10Gbps Switches

Benefits:

  • Stacking is available with cheaper 10Gbps switches (often a an 10Gbps port cost)
  • Switch redundancy (but not during firmware upgrades)
  • Get 20Gbps aggregated bandwidth in any scenario

Drawbacks:

  • Potential loss of 10Gbps ports
  • Firmware upgrades bring down the stack
  • Potentially more ‘”complex” switch configuration

vPC or VLT 10Gbps Switches

Benefits:

  • 100% Switch redundancy
  • Get > 10Gbps aggregated bandwidth in any possible NIC team scenario

Drawbacks:

  • More expensive switches
  • More ‘”complex” switch configuration

So all in all, if you come to the conclusion that 10Gbps is a big pipe that will serve your needs and aggregation of those via teaming is not needed you might be better off with cheaper 10Gbps leverage Windows Server 2012 NIC teaming in a switch independent mode in active active configuration. You optimize 10Gbps port count as well. It’s cheap, it reduces complexity and it doesn’t stop you from leveraging Multichannel/RDMA.

So right now I’m either in favor of switch independent 10Gbps networking or I go full out for a vPC (Virtual Port Channel a la CISCO) or VTL (Virtual Link Trunking a la Force10 – DELL) like setup and forgo stacking all together. As said if you’re willing/capable of evacuating all the nodes on a stack/rack you can work around the drawback. The colors in the racks indicate the same clusters. That’s not always possible and while it sounds like a great idea, I’m not convinced.

image

When the shit hits the fan … you need as little to worry about as possible. And yes I know firmware upgrades are supposed to be easy and planned events. But then there is reality and sometimes it bites, especially when you cannot evacuate the workload until you’re resolved a networking issue with a firmware upgrade Confused smile Choose your poison wisely.

Saying Goodbye To Old Hardware Responsibly


Last year we renewed our SAN storage and our backup systems. They had been serving us for 5 years and where truly end of life as both technologies uses are functionally obsolete in the current era of virtualization and private clouds. The timing was fortunate as we would have been limited in our Windows 2012, Hyper-V & disaster recovery plans if we had to keep it going for another couple of years.

Now any time you dispose of old hardware it’s a good idea to wipe the data securely to a decent standard such as DoD 5220.22-M. This holds true whether it’s a laptop, a printer or a storage system.

We did the following:

  • Un-initialize the SAN/VLS
  • Reinitialize the SAN/VLS
  • Un-initialize the SAN/VLS
  • Swap a lot of disks around between SAN/VLS and disk bays in a random fashion
  • Un-initialize the SAN/VLS
  • Create new (Mirrored) LUNS, as large as possible.
  • Mounted them to a host or host
  • Run the DoD grade  disk wiping software against them.
  • That process is completely automatic and foes faster than we were led to believe, so it was not really such a pain to do in the end. Just let it run for a week 24/7 and you’ll wipe a whole lot of data. There is no need to sit and watch progress counters.
  • Un-initialize the SAN/VLS
  • Have it removed by a certified company that assures proper disposal

We would have loved to take it to a shooting range and blast the hell of of those things but alas, that’s not very practical Smile nor feasible Sad smile. It would have been very therapeutic for the IT Ops guys who’ve been baby sitting the ever faster failing VLS hardware over the last years.

Here’s some pictures of the decommissioned systems. Below are the two old VLS backup systems, broken down and removed from the data center waiting disposal. It’s cheap commodity hardware with a reliability problem when over 3 years old and way to expensive for what is. Especially for up and out scaling later in the life time cycle, it’s just madness. Not to mention that those thing gave us more issues the the physical tape library (those still have a valid a viable role to play when used for the correct purposes). Anyway I consider this to have been my biggest technology choice mistake ever. If you want to read more about that go to Why I’m No Fan Of Virtual Tape Librariesimageimage

To see what replaced this with great success go to Disk to Disk Backup Solution with Windows Server 2012 & Commodity DELL Hardware – Part II

The old EVA 8000 SANs are awaiting removal in the junk yard area of the data center. They served us well and we’ve been early customers & loyal ones. But the platform was as dead as a dodo long before HP wanted to even admit to that. It took them quite a while to get the 3Par ready for the same market segment and I expect that cost them some sales. They’re ready today, they were not 24-12 months ago. image

image

So they’ve been replaced with Compellent SANs. You can read some info on this on previous blogs Multi Site SAN Storage & Windows Server 2012 Hyper-V Efforts Under Way and Migration LUNs to your Compellent SAN

The next years the storage wares will rage and the landscape will change a lot. But We’re out of the storm for now. We’ll leverage what we got Smile. One tip for all storage vendors. Start listening to your SME customers a lot more than you do now and getting the features they need into their hands. There are only so many big enterprises so until we’re all 100% cloudified, don’t ignore us, as together we buy a lot of stuff to. Many SMEs are interested in more optimal & richer support for their windows environments if you can deliver that you’ll see your sales rise. Keep commodity components, keep building blocks and from factors but don’t use a cookie cutter to determine our needs or “sell” us needs we don’t have. Time to market & open communication is important here. We really do keep an eye on technologies so it’s bad to come late to the party.

Hyper-V Cluster Node Pause & Drain fails – Live Migrations fail with “The requested operation cannot be completed because a resource has locked status”


One night I was doing some maintenance on a Hyper-V cluster and I wanted to Pause and drain one of the nodes that was up next for some tender loving care. But I was greeted by some messages:

image

[Window Title]
Resource Status

[Main Instruction]
The requested operation cannot be completed because a resource has locked status.

[Content]
The requested operation cannot be completed because a resource has locked status.

[OK]

Strange, the cluster is up and running, none of the other nodes had issues and operational wise all VMs are happy as can be. So what’s up? Not to much in the error logs except for this one related to a backup. Aha …We fire up disk part and see some extra LUNs mounted + using “vssadmin list writers“ we find:

clip_image002

 

 

Writer name: ‘Microsoft Hyper-V VSS Writer’
…Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
…Writer Instance Id: {2fa6f9ba-b613-4740-9bf3-e01eb4320a01}
…State: [5] Waiting for completion
…Last error: Unexpected error

Bingo! Hello old “friend”, I know you! The Microsoft Hyper-V VSS Writer goes into an error state during the making of hardware snapshots of the LUNs due to almost or completely full partitions inside the virtual machines. Take a look at this blog post on what causes this and how to fix fit. As a result we can’t do live migrations anymore or Pause/Drain the node on which the hardware snapshots are being taken.

And yes, after fixing the disk space issue on the VM (a SDT who’s pumped the VM disks 99.999% full) the Hyper-V VSS writer get’s out of the error state and the hardware provider can do it’s thing. After the snapshots had completed everything was fine and I could continue with my maintenance.

PowerShell: Monitoring DrainStatus of a Hyper-V Host & The Time Limited Value of Information In Beta & RC Era Blogs


I was writing some small PowerShell scripts to kick pause and resume Hyper-V cluster hosts and I wanted to monitor the progress of draining the virtual machines of the node when pausing it. I found this nice blog about Draining Nodes for Planned Maintenance with Windows Server 2012 discussing this subject and providing us with the properties to do just that.

It seems we have two common properties at our disposal: NodeDrainStatus and NodeDrainTarget.

image

So I set to work but I just didn’t manage to get those properties to be read. It was like they didn’t exist. So I pinged Jeff Wouters who happens to use PowerShell for just about anything and asked him if it was me being stupid and missing the obvious. Well it turned out to be missing the obvious for sure as those properties do no exist. Jeff told me to double check using:

Get-ClusterNode MyNode -cluster MyCluster | Select-Object -Property *

Guess what, it’s not NodeDrainStatus and NodeDrainTarget but DrainStatus and DrainTarget.

image

What put me off here was the following example in the same blog post:

Get-ClusterResourceType "Virtual Machine" | Get-ClusterParameter NodeDrainMoveTypeThreshold

That should have been a dead give away. As we’ve been using MoveTypeTresHold a lot the recent months and there is no NodeDrain in that value either. But it just didn’t register. By the way you don’t need to create the property either is exists. I guess this code was valid with some version (Beta?) but not anymore. You can just get en set the property like this

Get-ClusterResourceType “Virtual Machine” -Cluster MyCluster | Get-ClusterParameter MoveTypeThreshold

Get-ClusterResourceType “Virtual Machine” -Cluster MyCluster | Set-ClusterParameter MoveTypeThreshold 2000

So lessons learned. Trust but verify Smile.  Don’t forget that a lot of things in IT have a time limited value. Make sure that to look at the date of what you’re reading and about what pre RTM version of the product the information is relevant to.

To conclude here’s the PowerShell snippet I used to monitor the draining process.


Suspend-clusternode –Name crusader -Cluster warrior -Drain

Do
{
    Write-Host (get-clusternode –Name “crusader” -Cluster warrior).DrainStatus -ForegroundColor Magenta    
    Sleep 1
}
until ((get-clusternode –Name “crusader” -Cluster warrior).DrainStatus -ne "InProgress")

If ((get-clusternode –Name “crusader” -Cluster warrior).DrainStatus -eq "Completed")
{
    Write-Host (get-clusternode –Name “crusader” -Cluster warrior).DrainStatus -ForegroundColor Green
}

Which outputs

image