Fixing Two Small DELL Compellent Hardware Hiccups


Here’s two little tips to solve some small hardware issues you might run into with a Compellent SAN. But first, you’re never on your own with CoPilot support. They are just one phone call away so I suggest if you see these to minor issues you give them a call. I speak from experience that CoPilot rocks. They are really good and go the extra mile. Best storage support I have ever experienced.

Notes

  • Always notify CoPilot as they will see the alerts come in and will contact you for sure Smile. Afterwards they’ll almost certainly will do a quick health check for you. But even better during the entire process they keep an eye on things to make sure you SAN is doing just fine. And if you feel you’d like them to tackle this, they will send out an engineer I’m sure.
  • Note that we’re talking about the SC40 controllers & disk bays here. The newer genuine DELL hardware is better than the super micro ones.

The audible alert without any issues what so ever

We kept getting an audible alert after we had long solved any issues on one of the SANs. The system had been checked a couple of times and everything was in perfect working order. Except for that audible alarm that just didn’t want to quit. A low priority issue I know but every time we walk into the data center we were going “oh oh” for a false alert. That’s not the kind of conditioning you want. Alerts are only to be made when needed and than they do need to be acted upon!

Working on this with CoPilot support we got rid of it by reseating the upper I/O module. You can do this on the fly – without pulling SAS-cables out or so, they are redundant, as long as you do it one by one and the cabling is done right (they can verify that remotely for you if needed).

image

But we got lucky after the first one. After the “Swap Clear” was requested  every warning condition was cleared and we got rid of the audible alert beep!  Copilot was on the line with us and made sure all paths are up and running so no bad things could happen. That’s what you have a copilot for.

Front panel display dimming out on a Compellent Disk Bay

We have multiple Compellent SANs and on one of those we had a disk bay with a info panel that didn’t light up anymore. A silly issue but an annoying one as this one also show you the disk bay ID.

image

Do we really replace the disk bay to solve this one? As that light had come on and of a couple of time it could just be a bad contact so my colleague decided to take a look. First  he removed the protective cover and then, using some short & curved screw drivers, he took of the body part. The red arrow indicates the little latch that holds the small ribbon cable in place.

image

That was standing right open. After locking that down the info appeared again on the panel. The covers was screwed on again and voila. Solved.

TechNet Top Support Solutions From Microsoft Support Blog


As this year comes to an end I’d like to draw your attention to Microsoft’s new Top Support Solutions blog on TechNet. It was created this as part of their continuous efforts to keep the various  technical communities informed about the most relevant answers to the top questions or issues experienced with their products. They identify these top issues by analyzing the question in their forums and their other support channels.

image

So if you need to find answers for your self or your customers go take a look at the "Top Solutions Content" blog. Changes are you’ll find valuable information about the Microsoft top support solutions for several of their popular products in Server and Tools. It might save you and your clients or manager a lot of time, effort and money. It’s also a great resource to make your colleagues, community, user group or clients aware of.

DELL Server DRAC Card Soft Reset With Racadmin


Sometimes a DRAC goes BOINK

Sometimes a DRAC (Dell Remote Access Card) can give you issues. Sometimes it’s some lingering process or another hiccup that causes this. You can try a reboot but that doesn’t always fix the issue. You can go into the BIOS and cancel any running System Services. A “confused” DRAC card can also be fixed by shutting down the server and cutting power for 5 to 10 minutes. That’s good to know as a last resort but not very feasible a lot of times, bar a maintenance window when you’re on premise.

You can also try to do a local or a remote reset of the DRAC card via OpenManage  (OMSA), racadmin. See RACADM Command Line Interface for DRAC for more information on how and when to use this tool. The racadmin can be used for a lot of remote configuration and administration and one of those is a “soft reset” or basically a powercycle, aka reboot, of the drac card itself. Don’t worry your server stays up Smile.

Local: racadmin racreset soft

Remote: racadm -r <ip address> -u <username> -p <password> racreset soft

Real life example

I was doing routine maintenance on 4 Hyper-V clusters and as part of that DUPs (Dell update packages) were being deployed to upgrade some firmware. This can be automated nicely via Cluster Aware Updating and the logging option will help you pin point the issue. See http://workinghardinit.wordpress.com/2013/01/09/logging-cluster-aware-updating-hotfix-plug-in-installations-to-a-file-share/ for more information on this.

Just like we found that the DRAC upgrade was not succeeding on two nodes.

One it was due to the DUP not being able to access the Virtual USB Device

Software application name: iDRAC6
   Package version: 1.95
   Installed version: 1.92

Executing update…

Device does not impact TPM measurements.

Device: iDRAC6, Application: iDRAC6
  Failed to access Virtual USB Device

==================> Update Result <==================

Update was not applied

================================================

Exit code = 1 (Failure)

and the other was because there was some other lingering DRAC process.

 iDRAC is currently unable to process this request because of another task.
  Please attempt one or more of the following steps to cancel the pending iDRAC task:
  1) Wait 30 minutes and retry your request.
  2) Reboot the system; Press F10; select ‘Exit and Reboot’ from Unified Server Configurator, and retry your request.
  3) Reboot the system; Press Ctrl-E; select ‘System Services’. Then change ‘Cancel System Services’ to YES, which will close the pending task;
      Then press Enter at the warning message. Press ESC twice and select ‘Save Changes and Exit’ and retry your request.

==================> Update Result<==================

Update was not applied

================================================
Exit code = 1 (Failure)

They give some nice suggestions but the racreset is another nice one to have I your toolkit. It’s fast and effective.

Run racadmin racreset soft

image

Wait for a couple of minutes and then run the DUP or the items in SUU that failed. With some luck this will succeed now.

image

A reality Check On Disaster Recovery & Business Continuity


Introduction

Another blog post in “The Dilbert Life Series®” for those who are not taking everything personal. Every time business types start talking about business continuity, for some reason, call it experience or cynicism, my bull shit & assumption sensors go into high alert mode. They tend to spend a certain (sometimes considerable) amount of money on connectivity, storage, CPUs at a remote site, 2000 pages of documentation and think that covers just about anything they’ll need. They’ll then ask you when the automatic or 5 minute failover to the secondary site will be up and running. That’s when the time has come to subdue all those inflated expectations and reduce the expectation gap between business and IT as much as possible. It should never have come to that in the first place. But in this matter business people & analysts alike, often read (or are fed) some marchitecture docs with a bunch of sales brochures which make it al sound very easy and quickly accomplished. They sometimes think that the good old IT department is saying “no” again just because they are negative people who aren’t team players and lack the necessary “can do attitude” in world where their technology castle is falling down. Well, sorry to bust the bubble, but that’s not it. The world isn’t quite that black and white. You see the techies have to make it work and they’re the ones who have to deal with the real. Combine the above with a weak and rather incompetent IT manager bending over to the business (i.e. promising them heaven on earth) to stay in there good grace and it becomes a certainty they’re going to get a rude awakening. Not that the realities are all that bad. Far from it, but the expectations can be so high and unrealistic that disappointment is unavoidable.

The typical flow of things

The business is under pressure from peers, top management, government & regulators to pay attention to disaster recovery. This, inevitably leads to an interest in business continuity. Why, well we’re in a 24/7 economy and your consumer right to buy a new coffee table on line at 03:00 AM on a Sunday night is worth some effort.  So if we can do it for furniture we should certainly have it for more critical services. The business will hear about possible (technology) solutions and would like to see them implemented. Why wouldn’t they? It all sounds effective and logical. So why aren’t we all running of and doing it? Is it because IT is a bunch of lazy geeks playing FPS games online rather than working for their mythically high salaries? How hard can it be? It’s all over the press that IT is a commodity, easy, fast, dynamic and consumer driven so “we” the consumers want our business continuity now! But hey it costs money, time, a considerable and sustained effort and we have to deal with the less than optimal legacy applications (90% of what you’re running right now).

Realities & 24/7 standby personnel

The acronyms & buzz words the business comes up with after attending some tech briefing by Vendors Y & Z (those are a bit like infomercials but without the limited value those might have Sarcastic smile) can be quite entertaining. You could say these people at least pay attention to the consumerized business types. Well actually they don’t, but they do smell money and lots of it. Technically they are not lying. In a perfect world things might work like that … sort of, some times and maybe even when you need it. But it will really work well and reliable. Sure that’s not the vendors fault. He can’t help  that the cool “jump of a cliff” boots he sold you got you killed. Yes they are designed to jump of a cliff but anything above 1 meter without other precautions and technologies might cause bodily harm or even death. But gravity and its effects in combination with the complexity of your businesses are beyond the scope of their product solutions and are entirely your responsibility. Will you be able to cover all those aspects?

Also don’t forget the people factor. Do you have the right people & skill sets at your disposal 24/7 for that time when disaster strikes? Remember that could be on a hot summer night in a weekend when they are enjoying a few glasses of wine at a BBQ party and not at 10:15 AM on a Tuesday morning.

So what terminology flies around?

They hear about asynchronous or even synchronous replication of storage of applications. Sure it can work within a data center, depending on how well it is designed and setup. It can even work between data centers, especially for applications like Exchange 2010. But let’s face it, the technical limitations and the lack of support for this in many of the legacy applications will hinder this considerably.

They hear of things like stretched clusters and synchronous storage replication. Sure they’ll sell you all kinds of licensed features to make this works at the storage level with a lot of small print. Sometimes even at the cost of losing functionality that makes the storage interesting in the first place. At the network level anything below layer 3 probably suffers from too much optimism. Sure stretched subnets seem nice but … how reliable are these solutions in real live?

Consider the latency and less reliable connectivity.You can and will lose the link once in a while. With active-active or active-passive data centers that depend on each other both become single points of failure. And then there are all the scenarios where only one part of the entire technology stack that makes everything work fails. What if the application clustering survives but not the network, the storage or the database? You’re toast any way. Even worse, what if you get into a split brain scenario and have two sides writing data. Recover from that one my friend, there’s no merge process for that, only data recovery. What about live migration or live motion (state, storage, shared nothing) across data centers to avoid an impending disaster? That’s a pipe dream at the moment people. How long can you afford for this to take even if your link is 99.999% reliable? Chances are that in a crisis things need to happen vast to avoid disaster and guess what even in the same data center, during normal routine operations, we’re leveraging <1ms latency 10Gbps pipes for this. Are we going to get solutions that are affordable and robust? Yes, and I think the hypervisor vendors will help push the entire industry forward when I see what is happening in that space but we’re not in Walhalla yet.

Our client server application has high availability capabilities

There are those “robust and highly available application architectures” (ahum) that only hold true if nothing ever goes wrong or happens to the rest of the universe. “Disasters” such as the server hosting the license dongle that is rebooted for patching. Or, heaven forbid, your TCP/IP connection dropped some packages due to high volume traffic. No we can’t do QoS on the individual application level and even if we could it wouldn’t help. If your line of business software can’t handle a WAN link without serious performance impact or errors due to a dropped packet, it was probably written and tested on  <1ms latency networks against a database with only one active connection. It wasn’t designed, it was merely written. It’s not because software runs on an OS that can be made highly available and uses a database that can be clustered that this application has any high availability, let alone business continuity capabilities. Why would that application be happy switching over to another link. A link that is possibly further away and running on less resources and quite possibly against less capable storage? For your apps to works acceptably in such scenarios you would already have to redesign them.

You must also realize that a lot of acquired and home written software has IP addresses in configuration files instead of DNS names. Some even have IP addresses in code.  Some abuse local host files to deal with hard coded DNS names … There are tons of very bad practices out there running in production. And you want business continuity for that? Not just disaster recovery  to be clear but business continuity, preferably without dropping one beat. Done any real software and infrastructure engineering in your life time have you? Keeping a business running often looks like a a MacGyver series. Lots creativity, ingenuity, super glue, wire, duct tape and Swiss army knife or multi tool. This is still true today, it doesn’t sound cool to admit to it, but it needs to be said.

We can make this work with the right methodologies and strict processes

Next time you think that, go to the top floor and jump of, adhering to the flight methodologies and strict processes that rule aerodynamics. After the loud thud due to you hitting the deck, you’ll be nothing more than a pool of human waste. You cannot fly. On top of unrealistic scenarios things change so fast that documentation and procedures are very often out of date as soon as they are written.

Next time some “consultants” drop in selling you products & processes with fancy acronyms proclaiming rigorous adherence to these will safe the day consider the following. They make a bold assumption given the fact they don’t know even 10% of the apps and processes in your company. Even bolder because they ignore the fact that what they discover in interviews often barely scratches the surface. People can only tell you what they actually know or dare tell you. On top of that any discovery they do with tools is rather incomplete. If the job consist of merely pushing processes and methodologies around without reality checks you could be in for a big surprise. You need the holistic approach here, otherwise it’s make believe. It’s a bit like paratrooper training for night drops over enemy strong holds, to attack those and bring ‘m down. Only the training is done in a heated class room during meetings and on a computer. They do not ever put on all their gear, let alone jump out of an aircraft in the dead of night, regroup, hump all that gear to the rally points and engage the enemy in a training exercise. Well people, you’ll never be able to pull of business continuity in real life either if you don’t design and test properly and keep doing that. It’s fantasy land. Even in the best of circumstances no plan survives it first contact with the enemy and basically you would be doing the equivalent of a trooper firing his rifle for the very first time at night during a real engagement. That’s assuming you didn’t break your neck during the drop, got lost and managed to load the darn thing in the first place.

You’re a pain in the proverbial ass to work with

Am I being to negative? No, I’m being realistic. I know reality is a very unwelcome guest in fantasy land as it tends to disturb the feel good factor. Those pesky details are not just silly technological “manual labor” issues people. They’ll kill your shiny plans, waste tremendous amounts of money and time.

We can have mission critical applications protected and provide both disaster recovery and business continuity. For that the entire solution stack need to be designed for this. While possible, this makes things expensive and often only a dream for custom written and a lot of the shelf software. If you need business continuity, the applications need to be designed and written for it. If not, all the money and creativity in the world cannot guarantee you anything. In fact they are even at best ugly and very expensive hacks to cheap and not highly available software that poses as “mission critical”.

Conclusion

Seriously people, business continuity can be a very costly and complex subject. You’ll need to think this through. When making assumptions realize that you cannot go forward without confirming them. We operate by the mantra “assumptions are the mother of al fuckups” which is nothing more than the age old “Trust but verify” in action. There are many things you can do for disaster recovery and business continuity. Do them with insight, know what you are getting into and maybe forget about doing it without one second of interruption for your entire business.

Let’s say disaster strikes and the primary data center is destroyed. If you can restart and get running again with only a limited amount of work and productivity lost, you’re doing very well. Being down for only a couple of hours or days or even a week, will make you one of the top performers. Really! Try to get there first before thinking about continuous availability via disaster avoidance and automatic autonomous failovers.

One approach to achieve this is what I call “Pandora’s Box”. If a company wants to have business continuity for its entire stack of operations you’ll have to leave that box closed and replicate it entirely to another site. When you’re hit with a major long lasting disaster you eat the down time and loss of a certain delta, fire up the entire box in another location. That way you can avoid trying to micro manage it’s content. You’ll fail at that anyway. For short term disasters you have to eat the downtime. Deciding when to fail over is a hard decision. Also don’t forget about the process in reverse order. That’s another part of the ball game.

It’s sad to see that more money is spend consulting & advisers daydreaming than on realistic planning and mitigation. If you want to know why this is allowed to happen there’s always my series on The do’s and don’ts when engaging consultants Part I and Part II. FYI, the last guru I saw brought into a shop was “convinced” he could open Pandora’s Box and remain in control. He has left the building by now and it wasn’t a pretty sight, but that’s another story.

Hyper-V Cluster Node Pause & Drain fails – Live Migrations fail with “The requested operation cannot be completed because a resource has locked status”


One night I was doing some maintenance on a Hyper-V cluster and I wanted to Pause and drain one of the nodes that was up next for some tender loving care. But I was greeted by some messages:

image

[Window Title]
Resource Status

[Main Instruction]
The requested operation cannot be completed because a resource has locked status.

[Content]
The requested operation cannot be completed because a resource has locked status.

[OK]

Strange, the cluster is up and running, none of the other nodes had issues and operational wise all VMs are happy as can be. So what’s up? Not to much in the error logs except for this one related to a backup. Aha …We fire up disk part and see some extra LUNs mounted + using “vssadmin list writers“ we find:

clip_image002

 

 

Writer name: ‘Microsoft Hyper-V VSS Writer’
…Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
…Writer Instance Id: {2fa6f9ba-b613-4740-9bf3-e01eb4320a01}
…State: [5] Waiting for completion
…Last error: Unexpected error

Bingo! Hello old “friend”, I know you! The Microsoft Hyper-V VSS Writer goes into an error state during the making of hardware snapshots of the LUNs due to almost or completely full partitions inside the virtual machines. Take a look at this blog post on what causes this and how to fix fit. As a result we can’t do live migrations anymore or Pause/Drain the node on which the hardware snapshots are being taken.

And yes, after fixing the disk space issue on the VM (a SDT who’s pumped the VM disks 99.999% full) the Hyper-V VSS writer get’s out of the error state and the hardware provider can do it’s thing. After the snapshots had completed everything was fine and I could continue with my maintenance.

Remote File Browsing Issue In Windows Server 2012 Hyper-V Leaves Results Pane Empty Workaround


In Windows Server 2012 the Remote File Browsing functionality for Hyper-V acts ups on some nodes indicating a problem.

You can read what “Remote File Browsing” is on TechNet here. You use it to browse the file system on a remote Hyper-V server when creating a  new VM there for example.

Remote File Browsing is a shell namespace extension implemented by Hyper-V, it provides a way to browse the folders/files on remove Hyper-V server without requiring server to open extra shell over the network.

The path "::{0907616E-F5E6-48D8-9D61-A91C3D28106D}\HYPER-V-TEST" is to tell shell (explorer or common file dialog) that it is hosting/pointing to the RemoteFileBrowsing shell namespace extension on the HYPER-V-TEST. The guid is Hyper-V remotefilebrowsing shell namespace extension GUID. However, due to the limitation on common file browser, it is not able to translated into "Hyper-V Remote File Browsing".

Now in Windows Server 2012 we sometimes see the following when we use it:

image

It seems to work but the result pane remains empty. The cluster is healthy, the nodes are healthy, all nodes are identically configured. Some nodes have it, other don’t. We also can’t find any errors logged anywhere.

If you try to work around it using the UNC path that will fail due to security issues later so don’t even go there Winking smile

Basically we were a bit baffled (we could not reproduce it in the lab either) until we saw some posts on then forums, indicating we’re not the only one seeing this.

http://social.technet.microsoft.com/Forums/en-US/winserverhyperv/thread/608d0c3b-0a7b-4ad9-9843-5e5051dcd526

http://social.technet.microsoft.com/Forums/en-US/winserverhyperv/thread/7a34f5e1-76bc-493a-8a7a-e9f420bf6a79#d7dd4db7-d7bd-419d-aa72-b12e43cd7a5d

If you know your cluster is perfectly healthy forget all the security settings stuff and go straight to testing this “fix” or rather workaround: Toggle Audit Object Access on and off.

In our case I can confirm that these nodes had been under a group policy that audited registry entries during a period that we were trouble shooting network card settings change behavior. We had removed that policy by first reverting the settings to not configured and after some days by removing the GPO. But that didn’t work. Even with no audit policy configured we had to go to all nodes showing this behavior, opening the local Group Policy, toggling our Audit Object Access on for success,applying this and reverting this to No auditing again.

So fire up an MMC, add a snap-in

image

Select Group Policy Object

image

Accept the defaults

image

image

When don navigate to Computer Configuration -> Windows Settings -> Security Settings -> Local Policy -> Audit Policy -> Audit Object Access

image

Now try to use Remote Browser again (close & reopen all wizard windows and start over a new) to see the results:

image

Success! All is well again.

Notes:

  • We only see this on systems remotely connecting to Windows Server 2012 Hyper-V nodes that are running Windows Server 2012 or Windows 8 themselves not on Windows 2008 R2 or Windows 7 with the RSAT for W2K12 installed.
  • This is not related to Windows core alone due to missing GUI components or something.

Trouble Shooting Windows Server 2012 host based CommVault Backups with DELL Compellent hardware VSS provider of Hyper-V guests: ‘Microsoft Hyper-V VSS Writer’ State: [5] Waiting for completion


We have been running CommVault Simpana 9.0 R2 SP7 in combination with the DELL Compellent Hardware VSS provider to do host based backups of the virtual machines on our Windows Server 2012 Hyper-V clusters host with great success and speed.

We’ve run into two issues so far. One, I blogged about in DELL Compellent Hardware VSS Provider & Commvault on Windows Server 2012 Hyper-V nodes – Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0×80070005, Access is denied was an due to some missing permissions for the domain account we configured the Compellent Replay manager Service to run with. The solution for that issue can be found in that same blog post.

The other one was that sometimes during the backup of a Hyper-V host we got an error from CommVault that put the job in a “pending” status, kept trying and failing. The error is:

Error Code: [91:9], Description: Volume Shadow Copy Service (VSS) error. VSS service or writers may be in a bad state. Please check vsbkp.log and Windows Event Viewer for VSS related messages. Or run vssadmin list writers from command prompt to check state of the VSS writers.

clip_image001

When we look at the Compellent controller we see the following things happen:

  • The snapshots get made
  • They are mounted briefly and then dismounted.
  • They are deleted

The result at the CommVault end is that the job goes into a pending state with the above error. When we look at the state of the Microsoft Hyper-V VSS Writer by running “vssadmin list writer” …

image

… from an elevated command prompt we see:

Writer name: ‘Microsoft Hyper-V VSS Writer’
…Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
…Writer Instance Id: {2fa6f9ba-b613-4740-9bf3-e01eb4320a01}
…State: [5] Waiting for completion
…Last error: Retryable error

Note at this stage:

  1. Resuming the job doesn’t help (it actually keep trying by itself but no joy).
  2. Killing the job and restarting brings no joy. On top of that our friendly error “Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0×80070005, Access is denied.“ is back, but this time related to the error state of the ‘Microsoft Hyper-V VSS Writer’. The error now has changed a little and has become:

clip_image002

 

 

Writer name: ‘Microsoft Hyper-V VSS Writer’
…Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
…Writer Instance Id: {2fa6f9ba-b613-4740-9bf3-e01eb4320a01}
…State: [5] Waiting for completion
…Last error: Unexpected error

To get rid of this one we can restart the host or, less drastic, restart the Hyper-V Virtual Machine management Service (VMMS.exe) which will do the trick as well.  Before you do this , drain the node when you pause it, then resume it with the option failing back the roles. Windows 2012 makes it a breeze to do this without service interruption Smile

image

clip_image003

image

The Cause: Almost or completely full partitions inside the virtual machines

Looking for solutions when CommVault is involved can be tedious as their consultancy driven sales model isn’t focused on making information widely available. Trouble shooting VSS issues can also be considered a form of black art at times. Since this is Windows 2012 RTM an the date is September 20th 2012 as the moment of writing, there are not yet any hotfixes related to host level backups of Virtual machines and such. CommVault Simpana 9.0 R2 SP7 is also fully patched.

This,combined with the fact that we did not see anything like this during testing (and we did a fair amount) makes us look at the guests. That’s the big difference on a large production cluster. All those unique guests with their own history. We also know from the past years with VSS snapshots in Windows 2008(R2) that these tend to fail due to issues in the guests. Take a peak at Troubleshoot VSS issues that occur with Windows Server Backup (WBADMIN) in Windows Server 2008 and Windows Server 2008 R2 just for starters  As an example we already had seen one guest (dev/test server) that had 5 user logged in doing all kinds of reconfigurations and installs go into save mode during a backup, so it could be due to something rotten in certain guests. There is very much to consider when doing these kinds of backups.

By doing some comparing of successful & failed backups it really looks as if it was related to certain virtual machines. A lot of issues are caused by the VSS service, not running or not being able to do snapshots because of lack of space so perhaps this was the case here as well?

We poked around a bit. First let’s see what we can find in the Hyper-V specific logs like the Microsoft-Windows-Hyper-V-VMMS-Admin event log. Ah lot’s of errors relating to a number of guests!

image

Log Name:      Microsoft-Windows-Hyper-V-VMMS-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          19/09/2012 22:14:37
Event ID:      10102
Task Category: None
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      undisclosed server
Description:
Failed to create the volume shadow copy inside of virtual machine ‘undisclosedserver’. (Virtual machine ID 84521EG0G-8B7A-54ED-2F24-392A1761ED11)

Well people, that is called a clue Winking smile. So we did some Live Migration to isolate suspect VMs to a single node, run backups, see them fail, do the the same with a new and clean VM an it all works. and indeed … looking at the guest involved when the CommVault backup fails we that the VSS service is running and healthy but we do see all kind of badness related to disk space:

  • Large SQL Server backup files put aside on the system partition or or other disks
  • Application & service pack installers left behind,
  • Log and tempdb volumes running out of space.
  • Application Logs running out of control

That later one left 0MB of disk space on the system (Test Controller TFS shitting itself), but we managed to clear just enough to get to just over 1GB of free space which was enough to make the backup succeed.

clip_image001[8]

image

Servers, virtual or physical ones, should to be locked down to prevent such abuse. I know, I know. Did I already tell you I do not reside in a perfect world? We cannot protect against dev and test server admins who act without much care on their servers. We’ll just keep hammering at it to raise their awareness I guess. For end users and production servers we monitor those well enough to proactively avoid issues. With dev & test servers we don’t do so, or the response team would have a day’s work reacting to all alerts that daily dev & test usage on those servers generate.

The fix

  • Clear at least 1GB or a bit more inside each partition in the guest running on the host that has a failing backup. I prefer to have at least a couple of GB free  (10% to 15% => give yourself some head room people).
  • Then you can resume the backup job manually or let CommVault do that for you if it’s still in a pending state.
  • If you’ve killed the job make sure you restore the
  • Microsoft Hyper-V VSS Writer  to a healthy state as described above. Thanks to Live Migration this can be achieved without any down time.

Conclusion

There is experimenting, testing, production testing, production and finally real life environments where not all is done as it should be. Yes, really the world isn’t perfect. Managers sometimes think it’s click, click, Next, click and voila we’ve got a complex multisite system running. Well it isn’t like that and you need some time and skills to make it all work. Yes even in todays “cheap, fast, easy to run your business form your smartphone”  ecosystem of the private, hybrid and public cloud, where all is bliss and world peace reigns.

The DELL Compellent Hardware VSS provider & replay manager service handle all this without missing a beat, which is very comforting. As previous experiences with hardware VSS provides of other vendors make us think that these would probably have blown up by now.

How To Deploy Windows Server 2012 on DELL UEFI Now–Notes From The field


The most current UEFI OS Deployment on a R810 is a bit finicky when you want to deploy Windows Server 2012 using the normal procedure & selecting “Other OS” as it’s obvious that the entry for Windows Server  2012 is not in there yet. The problem is that the Windows installer doesn’t seem to create the best practice UEFI partitions. It just seems to create a 320MB System Reserved partition and the rest is for your OS installation as Primary partition. In a good (by the book UEFI) install you’d see a layout like this (from Sample: Configure UEFI/GPT-Based Hard Drive Partitions by Using Windows Setup):

image

image

The reason for this seems to be that the firmware is still not 100% up to date for how Windows Server 2012 deals with UEFI installations. This I learned via my very helpful twitter friend Florian Klaffenbach

While an update for the system firmware is in the works and won’t be to long away let me share you how I dealt with this issue. It’s a bit more work but it get’s the job done. At least for me on a R810 with BIOS version 2.7.4.

I’m copying and adapting the step by step from Microsoft Windows Server 2012 Early Adopter Guide – Dell here and adapting it to how I worked around it. It’s “magic” Winking smile.

Installing Using Dell Unified Server Configurator

  1. Connect the keyboard, monitor, mouse, and any additional peripherals to your system
  2. Turn on the system and the attached peripherals.
  3. Press <F10> in the POST to start the System Services. The Initializing UEFI. Please wait… and the Entering System Services…Starting Unified Server Configurator messages are displayed.
  4. In the Unified Server Configurator window, if you want to configure hardware, diagnostics, or set changes, click the appropriate option. If no changes are required, press OS Deployment. => you can opt to start with a cleanly build VDisk. Which is best and should suffice. But is doesn’t. We’ll clean the disk later anyway later on in Step 14.
  5. In the Operating System Deployment window, click Deploy OS. The Configure or Skip RAID window is displayed. If Redundant Array of Independent Disks (RAID) is configured, the window displays the existing RAID configuration details.
  6. Select Go directly to OS Deployment. If RAID is not yet configured, configure it at this time.
  7. Click Next. The Select Operating System window is displayed with a list of compatible operating systems.
  8. Choose Microsoft Windows Server 2012 and click Next.NOTE: If Microsoft Windows Server 2012 is not listed, choose any other operating system
  9. Choose whether you want to deploy the operating system in UEFI or BIOS mode, and click Next => I do not get this choice if UEFI is already on in the BIOS settings
  10. In the Insert OS Media window, insert the Windows Server 2012 media and click Next.
  11. In the Reboot the System screen, follow the instructions on the screen and click Finish. If a Windows operating system is already installed on your system, the following message is displayed: Press any key to boot from the CD/DVD …Press any key to begin the installation. If you used a clean VDisk this is no issue
  12. In the Windows Setup screen, select the appropriate option for Language, Time and Currency Format, and Keyboard or Input Method.
  13. Click Next to continue.
  14. STOP => Select to REPAIR your system and launch a command line. Form there you start diskpart and run following commands on the disk where you want to deploy Windows Server 2012:
    • select disk 0
    • clean
    • convert gpt

      In my case this is Disk 0. This is what the installer should be able to do automatically with a clean disk any way but it doesn’t happen.

      Now DO NOT navigate to the X:\ root and launch setup again. Shut exit the repair console and shutdown the server.

  15. Start the server
  16. Press <F10> in the POST to start the System Services. The Initializing UEFI. Please wait… and the Entering System Services…Starting Unified Server Configurator messages are displayed. => DO NOT TOUCH ANYTHING ANYMORE. It will take longer than expected but you will boot into the installation of Windows 2012 again.
  17. In the Windows Setup screen, select the appropriate option for Language, Time and Currency Format, and Keyboard or Input Method.
  18. Click Next to continue.
  19. On the next page, click Install Now.
  20. In the Operating System Install screen, select the operating system you want to install. Click Next. The License Terms window is displayed, click Next.
  21. In the Which Type of Installation Do You Want screen, click Custom: Install Windows only (advanced), if it is not selected already.
  22. In the Where do you want to install Windows screen, specify the partition on which you want to install the operating system. To create a partition and begin installation:
    1. Click New
    2. Specify the size of the partition in MB, and click Apply. A Windows might create additional partition for system files message is displayed. => NOW THE UEFI partitions on the GPT disk are created Open-mouthed smile.
    3. Click OK.Select the newly-created operating system partition and click Next.
      The Installing Windows screen is displayed and the installation process begins. After the operating system is installed the system reboots. You must set the administrator password before you can log in for the first time
  23. In the Settings screen, enter the password, confirm the password, and click Finish.
    The operating system installation is complete.

image

Now, while this worked for me on the Dell R810 with BIOS 2.7.4,  I give no guarantees whatsoever. You’ll have to test it yourself or wait for the firmware update that is coming soon. Any way, perhaps it helps some of you out there!

Fixing Hiccups in The SCVMM2008R2 GUI & Database


As you might very well know by experience sometimes the System Center Virtual Machine Manager GUI and database get out of sync with reality about what’s going on for real on the cluster. I’ve blogged about this before in SCVMM 2008 R2 Phantom VM guests after Blue Screen and in System Center Virtual Machine Manager 2008 R2 Error 12711 & The cluster group could not be found (0×1395)

The Issue

Recently I had to trouble shoot the “Missing” status of some virtual machines on a Hyper-V cluster in SCVMM2008R2. Rebooting the hosts, guests, restarting agents, … none of the usual tricks for this behavior seemed to do the trick. The SCVMM2008R2 installation was also fully up to date with service packs & patches so there the issue dot originate.

Repair was greyed out and was no use. We could have removed the host from SCVMM en add it again. That resets the database entries for that host en can help fix the issues but still is not guaranteed to work and you don’t learn what the root cause or solution is. But none of our usual tricks worked.We could have deleted the VMs from the database as in  but we didn’t have duplicates. Sure, this doesn’t delete any files or VM so it should show up again afterwards but why risk it not showing up again and having to go through fixing that.

The Cause

The VMs were in a “Missing” state after an attempted live migration during a manual patching cycle where the host was restarted the before the “start maintenance mode” had completed. A couple of those VMs where also Live Migrated at the same time with the Failover Cluster GUI. A bit of confusion al around so to speak nut luckily all VMs are fully operational an servicing applications & users so no crisis there.

The Fix

DISCLAIMER

I’m not telling you to use this method to fix this issue but you can at your own risk. As always please make sure you have good and verified backups of anything that’s of value to you Smile

We hade to investigate. The good news was that all VMs are up an running, there is no downtime at the moment and the cluster seems perfectly happy Smile.

But there we see the first clue. The Virtual machines on the cluster are not running on the node SCVMM thinks they are running, hence the “Missing” status.

First of all let’s find out what host the VM is really running on in the cluster and see what SCVMM thinks on what host the VM  is running. We run this little query against the VMM database. That gives us all hosts known to SCVMM.

SELECT [HostID],[ComputerName] FROM [VMM].[dbo].[tbl_ADHC_Host]

HostID                                                                        ComputerName

559D0C84-59C3-4A0A-8446-3A6C43ABF618          node1.test.lab

540C2477-00C3-4388-9F1B-31DBADAD1D8C        node2.test.lab

40B109A2-9E6B-47BC-8FB5-748688BFC0DF         node3.test.lab

C2DA03CE-011D-45E3-A389-200A3E3ED62E        node4.test.lab

6FA4ABBA-6599-4C7A-B632-80449DB3C54C         node5.test.lab

C0CF479F-F742-4851-B340-ED33C25E2013          node6.test.lab

D2639875-603F-4F49-B498-F7183444120A             node7.test.lab

CE119AAC-CF7E-4207-BE0B-03AAE0371165         node8.test.lab

AB07E1C2-B123-4AF5-922B-82F77C5885A2           node9.test.lab

(9 row(s) affected)

Voila en now the fun starts. SCVMM GUI tells us “MissingVM” is missing on node4.

We check this in the database to confirm:

SELECT Name, ObjectState, HostId
FROM VMM.dbo.tbl_WLC_VObject
WHERE Name = 'MissingVM'
GO

Which is indeed node4

Name                                                                                                                                                                                                                                                             ObjectState HostId

———  —  ————————————

node4  220  C2DA03CE-011D-45E3-A389-200A3E3ED62E

(1 row(s) affected)


In SCVMM we see that the moving of the VM failed. Between node 4 and node 6.

image

Now let’s take a look at what the cluster thinks … yes there it is running happily on node 6 and not on node 4. There’s the mismatch causing the issue.

So we need to fix this. We can Live Migrate the VM with the Failover Cluster GUI to the node SCVMM thinks the VM still resides on and see if that fixes it. If it does, great! You have to give SCVMM some time to detect all things and update its records.

But what to do if it doesn’t work out?  We can get the HostId from the node where the VM is really running in the cluster, which we can see in the Failover Cluster GUI, from the query we ran above and than update the record:

UPDATE VMM.dbo.tbl_WLC_VObject
SET HostId  = 'C0CF479F-F742-4851-B340-ED33C25E2013'
WHERE Name = 'MissingVM'
GO

We then reset the ObjectState to 0 to get rid of the Missing status. It would do this automatically but it takes a while.

UPDATE VMM.dbo.tbl_WLC_VObject
SET ObjectState = '0'
WHERE Name = 'MissingVM'
GO

After some patience & Refreshing all is well again and test with live migrations proves that all works again.

As I said before people get creative in how to achieve things due to inconsistencies, differences in functionality between Hyper-V Manager, Failover Cluster Manager and SCVMM 2008R2 can lead to some confusing situations. I’m happy to see that in Windows 8 the action you should perform using the Failover Cluster GUI or PowerShell are blocked in Hyper-V Manager. But SCVMM really needs a “reset” button that makes it check & validate that what it thinks is reality.

Integration Services Version Check Via Hyper-V Integration/Admin Event Log


I’ve written before (see "Key Value Pair Exchange WMI Component Property GuestIntrinsicExchangeItems & Assumptions") on the need to & ways with PowerShell to determine the version of the integration services or integration components running in your guests. These need to be in sync with the one running on the hosts. Meaning that all the hosts in a cluster should be running the same version as well as the guests.

During an upgrade with a service pack this get the necessary attention and scripts (PowerShell) are written to check versions and create reports and normally you end up with a pretty consistent cluster. Over time virtual machines are imported, inherited from another cluster of created on a test/developer host and shipped to production. I know, I know, this isn’t something that should happen, but I don’t always have the luxury of working in a perfect world.

Enough said. This means you might end up with guests that are not running the most recent version of the integration tools. Apart from checking manually in the guest (which is tedious, see my blog "Upgrading a Hyper-V R2 Cluster to Windows 2008 R2 SP1" on how to do this) or running previously mentioned script you can also check the Hyper-V event log.

Another way to spot virtual machines that might not have the most recent version of the integration tools is via the Hyper-V logs. In Server Manager you drill down in the “Diagnostics” to, “Event Viewer” and than navigate your way through  "Applications and Services Logs", "Microsoft", "Windows" until you hit “Hyper-V-Integration

image

Take a closer look and you’ll see the warning about 2 guests having an older version of the integration tools installed.

image

As you can see it records a warning for every virtual machine whose integration services are older than the host running Hyper-V. This makes it easy to grab a list of guest needing some attention. The down side is that you need to check all hosts, not to bad for a small cluster but not very efficient on the larger ones.

So just remember this as another way to spot virtual machines that might not have the most recent version of the integration tools. It’s not a replacement for some cool PowerShell scripting or the BPA tools, but it is a handy quick way to check the version for all the guests on a host when you’re in a hurry.

It might be nice if integration services version management becomes easier in the future. Meaning a built-in way to report on the versions in the guests and an easier way to deploy these automatically if there not part of a service pack (this is the case when the guest OS and the host OS differ or when you can’t install the SP in the guest for some application compatibility reason). You can do this in bulk using SCVMM and of cause Scripting this with PowerShell comes to the rescue here again, especially when dealing with hundreds of virtual machines in multiple large clusters. Orchestration via System Center Orchestrator can also be used. Integration with WSUS would be another nice option, for those that don’t have Configuration Manager or Orchestrator but that’s not supported as far as I know for now.