Fixing Two Small DELL Compellent Hardware Hiccups


Here’s two little tips to solve some small hardware issues you might run into with a Compellent SAN. But first, you’re never on your own with CoPilot support. They are just one phone call away so I suggest if you see these to minor issues you give them a call. I speak from experience that CoPilot rocks. They are really good and go the extra mile. Best storage support I have ever experienced.

Notes

  • Always notify CoPilot as they will see the alerts come in and will contact you for sure Smile. Afterwards they’ll almost certainly will do a quick health check for you. But even better during the entire process they keep an eye on things to make sure you SAN is doing just fine. And if you feel you’d like them to tackle this, they will send out an engineer I’m sure.
  • Note that we’re talking about the SC40 controllers & disk bays here. The newer genuine DELL hardware is better than the super micro ones.

The audible alert without any issues what so ever

We kept getting an audible alert after we had long solved any issues on one of the SANs. The system had been checked a couple of times and everything was in perfect working order. Except for that audible alarm that just didn’t want to quit. A low priority issue I know but every time we walk into the data center we were going “oh oh” for a false alert. That’s not the kind of conditioning you want. Alerts are only to be made when needed and than they do need to be acted upon!

Working on this with CoPilot support we got rid of it by reseating the upper I/O module. You can do this on the fly – without pulling SAS-cables out or so, they are redundant, as long as you do it one by one and the cabling is done right (they can verify that remotely for you if needed).

image

But we got lucky after the first one. After the “Swap Clear” was requested  every warning condition was cleared and we got rid of the audible alert beep!  Copilot was on the line with us and made sure all paths are up and running so no bad things could happen. That’s what you have a copilot for.

Front panel display dimming out on a Compellent Disk Bay

We have multiple Compellent SANs and on one of those we had a disk bay with a info panel that didn’t light up anymore. A silly issue but an annoying one as this one also show you the disk bay ID.

image

Do we really replace the disk bay to solve this one? As that light had come on and of a couple of time it could just be a bad contact so my colleague decided to take a look. First  he removed the protective cover and then, using some short & curved screw drivers, he took of the body part. The red arrow indicates the little latch that holds the small ribbon cable in place.

image

That was standing right open. After locking that down the info appeared again on the panel. The covers was screwed on again and voila. Solved.

TechNet Top Support Solutions From Microsoft Support Blog


As this year comes to an end I’d like to draw your attention to Microsoft’s new Top Support Solutions blog on TechNet. It was created this as part of their continuous efforts to keep the various  technical communities informed about the most relevant answers to the top questions or issues experienced with their products. They identify these top issues by analyzing the question in their forums and their other support channels.

image

So if you need to find answers for your self or your customers go take a look at the "Top Solutions Content" blog. Changes are you’ll find valuable information about the Microsoft top support solutions for several of their popular products in Server and Tools. It might save you and your clients or manager a lot of time, effort and money. It’s also a great resource to make your colleagues, community, user group or clients aware of.

DELL Server DRAC Card Soft Reset With Racadmin


Sometimes a DRAC goes BOINK

Sometimes a DRAC (Dell Remote Access Card) can give you issues. Sometimes it’s some lingering process or another hiccup that causes this. You can try a reboot but that doesn’t always fix the issue. You can go into the BIOS and cancel any running System Services. A “confused” DRAC card can also be fixed by shutting down the server and cutting power for 5 to 10 minutes. That’s good to know as a last resort but not very feasible a lot of times, bar a maintenance window when you’re on premise.

You can also try to do a local or a remote reset of the DRAC card via OpenManage  (OMSA), racadmin. See RACADM Command Line Interface for DRAC for more information on how and when to use this tool. The racadmin can be used for a lot of remote configuration and administration and one of those is a “soft reset” or basically a powercycle, aka reboot, of the drac card itself. Don’t worry your server stays up Smile.

Local: racadmin racreset soft

Remote: racadm -r <ip address> -u <username> -p <password> racreset soft

Real life example

I was doing routine maintenance on 4 Hyper-V clusters and as part of that DUPs (Dell update packages) were being deployed to upgrade some firmware. This can be automated nicely via Cluster Aware Updating and the logging option will help you pin point the issue. See http://workinghardinit.wordpress.com/2013/01/09/logging-cluster-aware-updating-hotfix-plug-in-installations-to-a-file-share/ for more information on this.

Just like we found that the DRAC upgrade was not succeeding on two nodes.

One it was due to the DUP not being able to access the Virtual USB Device

Software application name: iDRAC6
   Package version: 1.95
   Installed version: 1.92

Executing update…

Device does not impact TPM measurements.

Device: iDRAC6, Application: iDRAC6
  Failed to access Virtual USB Device

==================> Update Result <==================

Update was not applied

================================================

Exit code = 1 (Failure)

and the other was because there was some other lingering DRAC process.

 iDRAC is currently unable to process this request because of another task.
  Please attempt one or more of the following steps to cancel the pending iDRAC task:
  1) Wait 30 minutes and retry your request.
  2) Reboot the system; Press F10; select ‘Exit and Reboot’ from Unified Server Configurator, and retry your request.
  3) Reboot the system; Press Ctrl-E; select ‘System Services’. Then change ‘Cancel System Services’ to YES, which will close the pending task;
      Then press Enter at the warning message. Press ESC twice and select ‘Save Changes and Exit’ and retry your request.

==================> Update Result<==================

Update was not applied

================================================
Exit code = 1 (Failure)

They give some nice suggestions but the racreset is another nice one to have I your toolkit. It’s fast and effective.

Run racadmin racreset soft

image

Wait for a couple of minutes and then run the DUP or the items in SUU that failed. With some luck this will succeed now.

image

Great Hardware Support Equals Fast Windows 2012 R2 Implementation


I love it when a plan comes together

We adopted Windows 2012 when right after it went RTM in august 2012. Today we’re are already running Windows 2012 R2 and ready to step up the pace. If you are a VAR/ISV that does not have fast & good support for Windows Server 2012 R2 consider this your notice. You can’t lead from behind. Get your act together and take an example from Altaro. Small, sure, but good & fast. How do we get our act together so fast? Fast? Yes, but it does take time and effort.

As it turns out, we’re pretty well of with the DELL hardware stack. The generation 11 and 12 servers are supported by DELL and on the Windows Server Catalog for Windows Server 2012 R2.

image

For more information on Dell Server inbox driver support see: Windows Server 2012 R2 RTM Inbox Driver Support on Dell PowerEdge Servers. By the way I can testify that we’ve run Windows Sever 2012 R2 successfully on 9th Generation hardware (PowerEdge 1950/2950).

We’ve been running tests since Windows 2012 R2 Preview on R710/R720 and it has been a blast. We’ve kept them up to date with the latest firmware & drives via SUU. And for our Intel X520 and Mellanox ConnectX-3 we’ve had rapid support as well.

So what more could you want? Well support from your storage array vendor I would think. I’m happy to report that Storage Center 6.4 has been out since October 8th and it supports Windows Server 2012 R2. Dell Compellent Adds MLC SSD Tier – Bests 15K HDDs on Price and Performance. Mind you on a lazy Sunday afternoon 2 quick e-mails to CoPilot got me the answer that Storage Center 6.3.10 also supports Windows Server 2012 R2.  Sweet!

And that’s not just DELL, the Dell Compellent Storage Center 6.4 is fully Windows Server 2012 R2 logo certified! That’s what you want to see from you vendor. Fast & excellent support.

image

Here’s the entire DELL hardware line up with Windows Server 2012 R2 support. Happy upgrading & implementation! If you have Software Assurance you’re set to reap the benefits of that investment today!

To my all employers / clients, you see now, told you so. Now, I have a thing in common with Col. Hannibal Smith, I love it when a plan comes together Open-mouthed smile.

image

I know some of you think that all the testing, breaking, wrecking of Preview bits, RTM & GA versions we do looks like chaos. Especially when you visually add the test server & switch configurations. But that’s what it looks like to YOU. To the initiated this is well executed plan, dropping all assumptions, to establish what works & will hold up. The result is that we’re ready today and by extension, so are you.

First Windows Server 2012 Cluster/Hyper-V related Patches


With November 2012 Patch Tuesday having come and gone, the first hotfixes (it’s a cumulative update) related to Windows Server 2012 are available. These are relevant to both Hyper-V & Failover clustering (Scale Out File Server)  There is also an older hotfix that has been brought to our attention that related to certain versions Windows Server 2008/R2 domain controllers,which is also important for Windows Server 2012 Clustering. None of these are urgent/critical and only apply in specific circumstances but it’s good to keep up with these and protect your environment..

Windows 8 and Windows Server 2012 cumulative update: November 2012

http://support.microsoft.com/kb/2770917: A collection of small changes – for HA VMs (Hyper-V on Cluster) there are three minor CSV file system fixes in this Hotfix : Improves clustered server performance and reliability in Hyper-V and Scale-Out File Server scenarios. Improves SMB service and client reliability under certain stress conditions.

Error code when the kpasswd protocol fails after you perform an authoritative restore: “KDC_ERROR_S_PRINCIPAL_UNKNOWN”

http://support.microsoft.com/kb/976424: Install on every domain controller running Windows Server 2008 Service Pack 2  or Windows Server 2008 R2 in order to add a Windows Server 2012 failover cluster. This is included in Windows Server 2008 R2 Service Pack 1. So just see if you need this fix in your environment or not.

I’m happy to see Microsoft acting fast on these issues,, even if not critical, to serve & protect their customers deployments.

KB 2636573: Guest Crashes with Win2008R2 RTM/SP1 STOP 0xD1 in storvsc!StorChannelVmbusCallback During Live Migration


The BSOD

I helped hunt down this bug and tested the private fix. Some months ago, during the summer of 2011, I was putting some new Hyper-V clusters under stress tests. You know, letting it work very hard for a longer period of time to see if anything falls off or goes “boink". It all looked pretty robust and and after some tweaking also very fast. Just when you’re about to declare “we’re all set” here you see a BSOD on one of the guests that’s being live migrated happily announcing: “DRIVER_IRQL_NOT_LESS_OR_EQUALSTOP: 0x000000D1 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)”

image

Now that doesn’t make ME very happy however. So I investigate to see if there are any more VMs dropping dead during live migration but we don’t see any. Known issues like out of date versions of the integration tools or the like are not in play nor are any other possible suspects.

We throw the MEMORY.DMP file in the debugger and we come up with the following culprit:

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)

The driver probably at fault is storvsc.sys

Probably caused by : storvsc.sys ( storvsc!StorChannelVmbusCallback+2b8 )

Hmmmmmmm, we start searching the internet but we don’t find much. We also throw it on to Twitter to see if the community comes up with something. Meanwhile we keep looking and find this little blog post by a Microsoft support engineer Rob Scheepens:

http://blogs.technet.com/b/dip/archive/2011/10/21/win2008r2-rtm-stop-0xd1-in-storvsc-storchannelvmbuscallback-0x2b5.aspx

We pinged Rob and opened a case with MS support. That evening Hans Vredevoort (www.hyper-v.nu), who saw my tweet, mails me with the details of a fellow MVP in the USA having this same issue. We get in contact an via both Microsoft & the Hyper-V community we start hunting the cause of this bug. The progress on this issue can be read at the Microsoft blog above. You’ll notice that the fix is in the works now.

Hunting down the STOP error

What did we establish:

  • It only happens occasionally with a live migration and it rather ad hoc, not every time, not after X amount of live migrations or X amount of up time.
  • It seems sometimes to happens only with guests running dynamically expanding VHDs attached to ISCSI controllers in Hyper-V.  But that’s not really the case as I remember one being with  fixed VHD attached to an SCSI controller. In our case the VMs we could reproduce the issue with in a reasonable time were all SQL Server test and development guests running SQL Server 2008 where the dynamically expanding disks are used as “poor man’s thin provisioning”.
  • I have not heard of this on Windows 2008 hosts, only R2, but I have not tested this.

So it’s reproducible but it takes intensive live migration activity. Meanwhile we received private instrumentation to install on both guests & hosts to collect “enriched” memory.dumps when a guest experiences a BSOD. With PowerShell we have continuous live migration running to reproduce the issue. The fact that can live migrate over 10Gbps does help Smile. Because you can get lucky but in reality needs many hundreds of live migration to reproduce it. On some machines many thousands. Not a joke but I total we did 8000 Live migrations to test the fix and we did about 12000 to reproduce the issue on several VMs with different configurations to send memory dumps to MSFT. So yes, you really need some PowerShell and having a 10Gbps Live Migration network also helps Winking smile.

All the collected MEMORY.DMP files form these live migration exercises were uploaded to Microsoft for analysis. That took a while, also because they had a boatload of live migrations to do and I don’t know if their test lab has 10 Gbps.

On Tuesday the 25th of October Microsoft contacts us with good news. They have root-caused the problem and a hotfix is in the works. You can download that here http://support.microsoft.com/kb/2636573

On Thursday 27th of October we get access to a private fix and after installing this one we’ve been running thousands of live migrations without  seeing the issue.

The public release of this hotfix is currently planned (HTP11-12) under KB2636573.

The details for the curious

Root Cause?

The root cause can be summarized as follows: “StorVSP was modifying guest memory while the VM’s virtual motherboard was being powered off.” Doing this storvsc access a NULL pointer in a memory buffer that is already freed up. The result of this is a BSOD or STOP error in the virtual machine.

Only SCSI attached VHDs

OK but why do we only see this with SCSI attach VHDs? Now the issue happens during power down of the virtual machines’ mother board because there is a disk enumeration during the shut down phase. And this enumeration only happens with SCSI disks.

Right! So the more VHDs we had attached to SCSI controllers in a virtual machine the higher the likelihood of this happening.

Why so much more likely with dynamically expanding VHDs

But still we saw this exponentially more with dynamically expanding disks. Why is that? Well it’s not that dynamically expanding disks trigger disk enumeration more than fixed disks.  However it seems that any disk expansion, which causes write delays, can lead to a timing issue that will cause the disk enumeration to hit the issue described above. So this significantly increases the risk that the STOP error will happen and it explains that the chance this will happen with fixed VHDs attached to SCSI controllers is significantly lower. This is sync with what we saw. The virtual machines with a lot of dynamic disks attach to SCSI controllers that had a lot of activity (and thus potential for expanding) is the ones where we could reproduce this the fasted.

Conclusion

It can take some time to hunt down certain bugs, especially the rare ones that only happen every now and then so occurrences are few and far between. But when you put in some effort Microsoft helps out and works on a fix. And no you don’t have to have the most expensive support contract for that to happen. As a  matter of fact this call was logged under a free support call with the TechNet Plus subscription. And as it was a bug, they return it as unused.

Using Host Names in IIS in Combination with a KEMP LoadMaster


At a client the change over of a web site from old servers to new ones lead to the investigation of an issue with the hardware load balancer. Since that web site is related to an existing surveyors solutions suite that already had a KEMP LoadMaster 2200 in use the figured we’d also use it for the web site and no longer use WNLB.

Now the original web site had multiple DNS entries and host header names defined in IIS (see Configure a Host Header for a Web Site (IIS 7)) . Host header names in IIS allow you to host multiple web sites on an IIS server using the same IP address and port. A small added security benefit is that surfing on IP address fails which means we marginally disrupt some script kiddies & get an extra security checkbox marked during an audit Winking smile.

In our example we needed:

Note: The real names have been changed as well as the reasons why as this has some business & historical justifications that don’t matter here.

ntrip.surveyor.lab needs to be handled by the load balanced web servers in the solution. The http://www.surveyor.lab needs to be redirected to another web server to keep the business happy. However for political reasons we have to keep the DNS record for http://www.surveyor.lab pointing to the load balanced servers, i.e. the load master VIP.

Now without host names IIS al worked fine until we wanted to use HTTP redirect. As the web site is the same IP address for both names we either redirected them both or none. To fix this we needed two sites in IIS. The real one hosting ntrip.surveyor.lab and a “fake” one hosting the http://www.surveyor.lab that we want to redirect. Well as both are hosted on the same IP address and port on the IIS server we need to use host names. But then the sites became unavailable.

When checking the LoadMaster configuration, the virtual service for the web servers seemed well.

image

Is this a limitation of hardware load balancing or this specific Loadmaster? Some searching on the internet made it look like I was about the only on on the planet dealing with this issue so no help there.

Kemp Support Rocks

I already knew this but this experience reaffirms it. KEMP Technologies really does care about their customers and are very fast & responsive. I threw a quick question on twitter to @KempTech on Twitter and they responded very fast with some pointers. After that I replied with some more details, they offered to take it on via other means as twitter has it limits. OK, no problems. The next morning I got an e-mail from one of their engineers (Ekkehard) with more information and a request for more input from our side. I quickly made a VISIO diagram of the current and the desired situation. Based on this he let me know this should work.

image

He asked for a copy of the configuration and already pointed to the solution:

And what exactly happens – does the RS turn “red” in the “View/Modify Services” view? That might be caused by the health check settings…
(Remember that a 302 is considered NOT ok, so you had to enter the proper check URL and or / HTTP1.1 hostname)

But at that moment I did not realize this yet. I saw no error or the real server turning red indicating it was down. So we went through the configuration and decided to test without forcing layer 7 to see what happened. This didn’t make a difference and it wasn’t really a solution if it had as we needed layer 7 and layer 7 transparency.

Ekkehard also noticed my firmware was getting rather old (don’t fix what isn’t broken Smile) and suggest an upgrade (5.1-24 to 5.1.-74). So I did, reboot and tested some more settings. To make sure I didn’t miss anything I threw a network sniffer (WireShark) against the issue. And guess what?  As soon as I added a host name to the IIS web site bindings I didn’t even get any request from my client on that server anymore. So it was definitely being stopped at the Loadmaster. Without it request from a client came through perfectly.  That was not IIS doing as with a host name nothing came into the server. So why would the LoadMaster stop traffic to a real server? Because it’s down, that’s why, just like Ekkehard has indicated in one of his mails but we didn’t see it then.

Better check again and sure enough, the health service told me the real servers are down. Hey … that’s new. Did the previous firmware not show this, or just slower? I can’t say for sure. It’s either me being to impatient, a hiccup, the firmware or premature dementia Confused smile

Root Cause

So what happens? The default health check uses HTTP 1.0. You can customize it with a path like  /owa or such but in essence it uses the IP address of the real server and guess what. With a Host header name in IIS that isn’t allowed other wise it can’t figure out what website you want to go to if you’re using this feature to run multiple sites on the same IP address and port. So we need to check the health based on host name. Can the LoadMaster do that for us? Yes it can!

The fix

You need to enable HTTP 1.1 and fill out the host name you want to use for health checking.  In our case that’s ntrip.surveyor.lab. That’s all there’s to it. Easy as can be if you know. And Ekkehard knew he indicated to this in his quoted mail above.

HTTP1 1host

 

Lessons Learned

So how did I not know this? Isn’t this documented? Sure enough on page  56 of the LoadMaster manual it says the following:

7  HTTP  The LoadMaster opens a TCP connection to the Real Server on the Service port (port80). The LoadMaster sends a HTTP/1.0 HEAD request the server, requesting the page ―/‖.  If the server sends a HTTP response with a status code of 2 (200-299, 301, 302, 401) the LoadMaster closes the connection and marks the server as active.  If the server fails to respond within the configured response time for the configured number of times or if it responds with a different status code, it is assumed dead.  HTTP 1.0 and 1.1 support available, using HTTP 1.1 allows you to check host header enabled web servers.

Typical, you read the exact line of information you need AND understand it after having figured it out. Now linking that information (yes we always read all manuals completely Embarrassed smile) to the situation at hand isn’t always that fast a process but I got there in the end with some help from KEMP Technologies.

One hint is perhaps to mention this is in the handy tips that pop up when you hover over a setting in the LoadMaster console. I rely on this a lot and a mention of “HTTP 1.1 allows you to check host header enabled web servers” might have helped me out. But it’s not there. A very poor excuse I know … Embarrassed smile

image

Host Header Names & HTTP redirection

After having fix this issue I proceeded to configure HTTP redirect in IIS 7.5. For this is used two sites. One was just a fake site tied to the www.surveyors.lab hostname in IIS on port 80.

image

For this site I created a HTTP redirect to www.bussines.lab/surveyors/services. This works just fine as long as you don’t forget the http:// in the redirect URL.

image

So it has to be http://www.bussines.lab/surveyors/services or you’ll get a funky loop effect looking like this:

http://www.surveyors.lab/www.bussines.lab/surveyors/services/www.bussines.lab/surveyors/services/www.bussines.lab/surveyors/services

Firefox will tell you you have a loop that will never end but Internet Explorer doesn’t, it just fails. You do get that URL as a pointer to the cause of the issue. That is if you can relate it to that.

The other was the real site  and was configured with following bindings and without redirection.

image

Don’t forget to do this on all real servers in the farm! The next thing I need to find out is how to health check two host names in the LoadMaster as I have two websites with the same IP address, port but different host names.