Fixing Two Small DELL Compellent Hardware Hiccups


Here’s two little tips to solve some small hardware issues you might run into with a Compellent SAN. But first, you’re never on your own with CoPilot support. They are just one phone call away so I suggest if you see these to minor issues you give them a call. I speak from experience that CoPilot rocks. They are really good and go the extra mile. Best storage support I have ever experienced.

Notes

  • Always notify CoPilot as they will see the alerts come in and will contact you for sure Smile. Afterwards they’ll almost certainly will do a quick health check for you. But even better during the entire process they keep an eye on things to make sure you SAN is doing just fine. And if you feel you’d like them to tackle this, they will send out an engineer I’m sure.
  • Note that we’re talking about the SC40 controllers & disk bays here. The newer genuine DELL hardware is better than the super micro ones.

The audible alert without any issues what so ever

We kept getting an audible alert after we had long solved any issues on one of the SANs. The system had been checked a couple of times and everything was in perfect working order. Except for that audible alarm that just didn’t want to quit. A low priority issue I know but every time we walk into the data center we were going “oh oh” for a false alert. That’s not the kind of conditioning you want. Alerts are only to be made when needed and than they do need to be acted upon!

Working on this with CoPilot support we got rid of it by reseating the upper I/O module. You can do this on the fly – without pulling SAS-cables out or so, they are redundant, as long as you do it one by one and the cabling is done right (they can verify that remotely for you if needed).

image

But we got lucky after the first one. After the “Swap Clear” was requested  every warning condition was cleared and we got rid of the audible alert beep!  Copilot was on the line with us and made sure all paths are up and running so no bad things could happen. That’s what you have a copilot for.

Front panel display dimming out on a Compellent Disk Bay

We have multiple Compellent SANs and on one of those we had a disk bay with a info panel that didn’t light up anymore. A silly issue but an annoying one as this one also show you the disk bay ID.

image

Do we really replace the disk bay to solve this one? As that light had come on and of a couple of time it could just be a bad contact so my colleague decided to take a look. First  he removed the protective cover and then, using some short & curved screw drivers, he took of the body part. The red arrow indicates the little latch that holds the small ribbon cable in place.

image

That was standing right open. After locking that down the info appeared again on the panel. The covers was screwed on again and voila. Solved.

TechNet Top Support Solutions From Microsoft Support Blog


As this year comes to an end I’d like to draw your attention to Microsoft’s new Top Support Solutions blog on TechNet. It was created this as part of their continuous efforts to keep the various  technical communities informed about the most relevant answers to the top questions or issues experienced with their products. They identify these top issues by analyzing the question in their forums and their other support channels.

image

So if you need to find answers for your self or your customers go take a look at the "Top Solutions Content" blog. Changes are you’ll find valuable information about the Microsoft top support solutions for several of their popular products in Server and Tools. It might save you and your clients or manager a lot of time, effort and money. It’s also a great resource to make your colleagues, community, user group or clients aware of.

DELL Server DRAC Card Soft Reset With Racadmin


Sometimes a DRAC goes BOINK

Sometimes a DRAC (Dell Remote Access Card) can give you issues. Sometimes it’s some lingering process or another hiccup that causes this. You can try a reboot but that doesn’t always fix the issue. You can go into the BIOS and cancel any running System Services. A “confused” DRAC card can also be fixed by shutting down the server and cutting power for 5 to 10 minutes. That’s good to know as a last resort but not very feasible a lot of times, bar a maintenance window when you’re on premise.

You can also try to do a local or a remote reset of the DRAC card via OpenManage  (OMSA), racadmin. See RACADM Command Line Interface for DRAC for more information on how and when to use this tool. The racadmin can be used for a lot of remote configuration and administration and one of those is a “soft reset” or basically a powercycle, aka reboot, of the drac card itself. Don’t worry your server stays up Smile.

Local: racadmin racreset soft

Remote: racadm -r <ip address> -u <username> -p <password> racreset soft

Real life example

I was doing routine maintenance on 4 Hyper-V clusters and as part of that DUPs (Dell update packages) were being deployed to upgrade some firmware. This can be automated nicely via Cluster Aware Updating and the logging option will help you pin point the issue. See http://workinghardinit.wordpress.com/2013/01/09/logging-cluster-aware-updating-hotfix-plug-in-installations-to-a-file-share/ for more information on this.

Just like we found that the DRAC upgrade was not succeeding on two nodes.

One it was due to the DUP not being able to access the Virtual USB Device

Software application name: iDRAC6
   Package version: 1.95
   Installed version: 1.92

Executing update…

Device does not impact TPM measurements.

Device: iDRAC6, Application: iDRAC6
  Failed to access Virtual USB Device

==================> Update Result <==================

Update was not applied

================================================

Exit code = 1 (Failure)

and the other was because there was some other lingering DRAC process.

 iDRAC is currently unable to process this request because of another task.
  Please attempt one or more of the following steps to cancel the pending iDRAC task:
  1) Wait 30 minutes and retry your request.
  2) Reboot the system; Press F10; select ‘Exit and Reboot’ from Unified Server Configurator, and retry your request.
  3) Reboot the system; Press Ctrl-E; select ‘System Services’. Then change ‘Cancel System Services’ to YES, which will close the pending task;
      Then press Enter at the warning message. Press ESC twice and select ‘Save Changes and Exit’ and retry your request.

==================> Update Result<==================

Update was not applied

================================================
Exit code = 1 (Failure)

They give some nice suggestions but the racreset is another nice one to have I your toolkit. It’s fast and effective.

Run racadmin racreset soft

image

Wait for a couple of minutes and then run the DUP or the items in SUU that failed. With some luck this will succeed now.

image

Great Hardware Support Equals Fast Windows 2012 R2 Implementation


I love it when a plan comes together

We adopted Windows 2012 when right after it went RTM in august 2012. Today we’re are already running Windows 2012 R2 and ready to step up the pace. If you are a VAR/ISV that does not have fast & good support for Windows Server 2012 R2 consider this your notice. You can’t lead from behind. Get your act together and take an example from Altaro. Small, sure, but good & fast. How do we get our act together so fast? Fast? Yes, but it does take time and effort.

As it turns out, we’re pretty well of with the DELL hardware stack. The generation 11 and 12 servers are supported by DELL and on the Windows Server Catalog for Windows Server 2012 R2.

image

For more information on Dell Server inbox driver support see: Windows Server 2012 R2 RTM Inbox Driver Support on Dell PowerEdge Servers. By the way I can testify that we’ve run Windows Sever 2012 R2 successfully on 9th Generation hardware (PowerEdge 1950/2950).

We’ve been running tests since Windows 2012 R2 Preview on R710/R720 and it has been a blast. We’ve kept them up to date with the latest firmware & drives via SUU. And for our Intel X520 and Mellanox ConnectX-3 we’ve had rapid support as well.

So what more could you want? Well support from your storage array vendor I would think. I’m happy to report that Storage Center 6.4 has been out since October 8th and it supports Windows Server 2012 R2. Dell Compellent Adds MLC SSD Tier – Bests 15K HDDs on Price and Performance. Mind you on a lazy Sunday afternoon 2 quick e-mails to CoPilot got me the answer that Storage Center 6.3.10 also supports Windows Server 2012 R2.  Sweet!

And that’s not just DELL, the Dell Compellent Storage Center 6.4 is fully Windows Server 2012 R2 logo certified! That’s what you want to see from you vendor. Fast & excellent support.

image

Here’s the entire DELL hardware line up with Windows Server 2012 R2 support. Happy upgrading & implementation! If you have Software Assurance you’re set to reap the benefits of that investment today!

To my all employers / clients, you see now, told you so. Now, I have a thing in common with Col. Hannibal Smith, I love it when a plan comes together Open-mouthed smile.

image

I know some of you think that all the testing, breaking, wrecking of Preview bits, RTM & GA versions we do looks like chaos. Especially when you visually add the test server & switch configurations. But that’s what it looks like to YOU. To the initiated this is well executed plan, dropping all assumptions, to establish what works & will hold up. The result is that we’re ready today and by extension, so are you.

First Windows Server 2012 Cluster/Hyper-V related Patches


With November 2012 Patch Tuesday having come and gone, the first hotfixes (it’s a cumulative update) related to Windows Server 2012 are available. These are relevant to both Hyper-V & Failover clustering (Scale Out File Server)  There is also an older hotfix that has been brought to our attention that related to certain versions Windows Server 2008/R2 domain controllers,which is also important for Windows Server 2012 Clustering. None of these are urgent/critical and only apply in specific circumstances but it’s good to keep up with these and protect your environment..

Windows 8 and Windows Server 2012 cumulative update: November 2012

http://support.microsoft.com/kb/2770917: A collection of small changes – for HA VMs (Hyper-V on Cluster) there are three minor CSV file system fixes in this Hotfix : Improves clustered server performance and reliability in Hyper-V and Scale-Out File Server scenarios. Improves SMB service and client reliability under certain stress conditions.

Error code when the kpasswd protocol fails after you perform an authoritative restore: “KDC_ERROR_S_PRINCIPAL_UNKNOWN”

http://support.microsoft.com/kb/976424: Install on every domain controller running Windows Server 2008 Service Pack 2  or Windows Server 2008 R2 in order to add a Windows Server 2012 failover cluster. This is included in Windows Server 2008 R2 Service Pack 1. So just see if you need this fix in your environment or not.

I’m happy to see Microsoft acting fast on these issues,, even if not critical, to serve & protect their customers deployments.

KB 2636573: Guest Crashes with Win2008R2 RTM/SP1 STOP 0xD1 in storvsc!StorChannelVmbusCallback During Live Migration


The BSOD

I helped hunt down this bug and tested the private fix. Some months ago, during the summer of 2011, I was putting some new Hyper-V clusters under stress tests. You know, letting it work very hard for a longer period of time to see if anything falls off or goes “boink". It all looked pretty robust and and after some tweaking also very fast. Just when you’re about to declare “we’re all set” here you see a BSOD on one of the guests that’s being live migrated happily announcing: “DRIVER_IRQL_NOT_LESS_OR_EQUALSTOP: 0x000000D1 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)”

image

Now that doesn’t make ME very happy however. So I investigate to see if there are any more VMs dropping dead during live migration but we don’t see any. Known issues like out of date versions of the integration tools or the like are not in play nor are any other possible suspects.

We throw the MEMORY.DMP file in the debugger and we come up with the following culprit:

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)

The driver probably at fault is storvsc.sys

Probably caused by : storvsc.sys ( storvsc!StorChannelVmbusCallback+2b8 )

Hmmmmmmm, we start searching the internet but we don’t find much. We also throw it on to Twitter to see if the community comes up with something. Meanwhile we keep looking and find this little blog post by a Microsoft support engineer Rob Scheepens:

http://blogs.technet.com/b/dip/archive/2011/10/21/win2008r2-rtm-stop-0xd1-in-storvsc-storchannelvmbuscallback-0x2b5.aspx

We pinged Rob and opened a case with MS support. That evening Hans Vredevoort (www.hyper-v.nu), who saw my tweet, mails me with the details of a fellow MVP in the USA having this same issue. We get in contact an via both Microsoft & the Hyper-V community we start hunting the cause of this bug. The progress on this issue can be read at the Microsoft blog above. You’ll notice that the fix is in the works now.

Hunting down the STOP error

What did we establish:

  • It only happens occasionally with a live migration and it rather ad hoc, not every time, not after X amount of live migrations or X amount of up time.
  • It seems sometimes to happens only with guests running dynamically expanding VHDs attached to ISCSI controllers in Hyper-V.  But that’s not really the case as I remember one being with  fixed VHD attached to an SCSI controller. In our case the VMs we could reproduce the issue with in a reasonable time were all SQL Server test and development guests running SQL Server 2008 where the dynamically expanding disks are used as “poor man’s thin provisioning”.
  • I have not heard of this on Windows 2008 hosts, only R2, but I have not tested this.

So it’s reproducible but it takes intensive live migration activity. Meanwhile we received private instrumentation to install on both guests & hosts to collect “enriched” memory.dumps when a guest experiences a BSOD. With PowerShell we have continuous live migration running to reproduce the issue. The fact that can live migrate over 10Gbps does help Smile. Because you can get lucky but in reality needs many hundreds of live migration to reproduce it. On some machines many thousands. Not a joke but I total we did 8000 Live migrations to test the fix and we did about 12000 to reproduce the issue on several VMs with different configurations to send memory dumps to MSFT. So yes, you really need some PowerShell and having a 10Gbps Live Migration network also helps Winking smile.

All the collected MEMORY.DMP files form these live migration exercises were uploaded to Microsoft for analysis. That took a while, also because they had a boatload of live migrations to do and I don’t know if their test lab has 10 Gbps.

On Tuesday the 25th of October Microsoft contacts us with good news. They have root-caused the problem and a hotfix is in the works. You can download that here http://support.microsoft.com/kb/2636573

On Thursday 27th of October we get access to a private fix and after installing this one we’ve been running thousands of live migrations without  seeing the issue.

The public release of this hotfix is currently planned (HTP11-12) under KB2636573.

The details for the curious

Root Cause?

The root cause can be summarized as follows: “StorVSP was modifying guest memory while the VM’s virtual motherboard was being powered off.” Doing this storvsc access a NULL pointer in a memory buffer that is already freed up. The result of this is a BSOD or STOP error in the virtual machine.

Only SCSI attached VHDs

OK but why do we only see this with SCSI attach VHDs? Now the issue happens during power down of the virtual machines’ mother board because there is a disk enumeration during the shut down phase. And this enumeration only happens with SCSI disks.

Right! So the more VHDs we had attached to SCSI controllers in a virtual machine the higher the likelihood of this happening.

Why so much more likely with dynamically expanding VHDs

But still we saw this exponentially more with dynamically expanding disks. Why is that? Well it’s not that dynamically expanding disks trigger disk enumeration more than fixed disks.  However it seems that any disk expansion, which causes write delays, can lead to a timing issue that will cause the disk enumeration to hit the issue described above. So this significantly increases the risk that the STOP error will happen and it explains that the chance this will happen with fixed VHDs attached to SCSI controllers is significantly lower. This is sync with what we saw. The virtual machines with a lot of dynamic disks attach to SCSI controllers that had a lot of activity (and thus potential for expanding) is the ones where we could reproduce this the fasted.

Conclusion

It can take some time to hunt down certain bugs, especially the rare ones that only happen every now and then so occurrences are few and far between. But when you put in some effort Microsoft helps out and works on a fix. And no you don’t have to have the most expensive support contract for that to happen. As a  matter of fact this call was logged under a free support call with the TechNet Plus subscription. And as it was a bug, they return it as unused.

Using Host Names in IIS in Combination with a KEMP LoadMaster


At a client the change over of a web site from old servers to new ones lead to the investigation of an issue with the hardware load balancer. Since that web site is related to an existing surveyors solutions suite that already had a KEMP LoadMaster 2200 in use the figured we’d also use it for the web site and no longer use WNLB.

Now the original web site had multiple DNS entries and host header names defined in IIS (see Configure a Host Header for a Web Site (IIS 7)) . Host header names in IIS allow you to host multiple web sites on an IIS server using the same IP address and port. A small added security benefit is that surfing on IP address fails which means we marginally disrupt some script kiddies & get an extra security checkbox marked during an audit Winking smile.

In our example we needed:

Note: The real names have been changed as well as the reasons why as this has some business & historical justifications that don’t matter here.

ntrip.surveyor.lab needs to be handled by the load balanced web servers in the solution. The http://www.surveyor.lab needs to be redirected to another web server to keep the business happy. However for political reasons we have to keep the DNS record for http://www.surveyor.lab pointing to the load balanced servers, i.e. the load master VIP.

Now without host names IIS al worked fine until we wanted to use HTTP redirect. As the web site is the same IP address for both names we either redirected them both or none. To fix this we needed two sites in IIS. The real one hosting ntrip.surveyor.lab and a “fake” one hosting the http://www.surveyor.lab that we want to redirect. Well as both are hosted on the same IP address and port on the IIS server we need to use host names. But then the sites became unavailable.

When checking the LoadMaster configuration, the virtual service for the web servers seemed well.

image

Is this a limitation of hardware load balancing or this specific Loadmaster? Some searching on the internet made it look like I was about the only on on the planet dealing with this issue so no help there.

Kemp Support Rocks

I already knew this but this experience reaffirms it. KEMP Technologies really does care about their customers and are very fast & responsive. I threw a quick question on twitter to @KempTech on Twitter and they responded very fast with some pointers. After that I replied with some more details, they offered to take it on via other means as twitter has it limits. OK, no problems. The next morning I got an e-mail from one of their engineers (Ekkehard) with more information and a request for more input from our side. I quickly made a VISIO diagram of the current and the desired situation. Based on this he let me know this should work.

image

He asked for a copy of the configuration and already pointed to the solution:

And what exactly happens – does the RS turn “red” in the “View/Modify Services” view? That might be caused by the health check settings…
(Remember that a 302 is considered NOT ok, so you had to enter the proper check URL and or / HTTP1.1 hostname)

But at that moment I did not realize this yet. I saw no error or the real server turning red indicating it was down. So we went through the configuration and decided to test without forcing layer 7 to see what happened. This didn’t make a difference and it wasn’t really a solution if it had as we needed layer 7 and layer 7 transparency.

Ekkehard also noticed my firmware was getting rather old (don’t fix what isn’t broken Smile) and suggest an upgrade (5.1-24 to 5.1.-74). So I did, reboot and tested some more settings. To make sure I didn’t miss anything I threw a network sniffer (WireShark) against the issue. And guess what?  As soon as I added a host name to the IIS web site bindings I didn’t even get any request from my client on that server anymore. So it was definitely being stopped at the Loadmaster. Without it request from a client came through perfectly.  That was not IIS doing as with a host name nothing came into the server. So why would the LoadMaster stop traffic to a real server? Because it’s down, that’s why, just like Ekkehard has indicated in one of his mails but we didn’t see it then.

Better check again and sure enough, the health service told me the real servers are down. Hey … that’s new. Did the previous firmware not show this, or just slower? I can’t say for sure. It’s either me being to impatient, a hiccup, the firmware or premature dementia Confused smile

Root Cause

So what happens? The default health check uses HTTP 1.0. You can customize it with a path like  /owa or such but in essence it uses the IP address of the real server and guess what. With a Host header name in IIS that isn’t allowed other wise it can’t figure out what website you want to go to if you’re using this feature to run multiple sites on the same IP address and port. So we need to check the health based on host name. Can the LoadMaster do that for us? Yes it can!

The fix

You need to enable HTTP 1.1 and fill out the host name you want to use for health checking.  In our case that’s ntrip.surveyor.lab. That’s all there’s to it. Easy as can be if you know. And Ekkehard knew he indicated to this in his quoted mail above.

HTTP1 1host

 

Lessons Learned

So how did I not know this? Isn’t this documented? Sure enough on page  56 of the LoadMaster manual it says the following:

7  HTTP  The LoadMaster opens a TCP connection to the Real Server on the Service port (port80). The LoadMaster sends a HTTP/1.0 HEAD request the server, requesting the page ―/‖.  If the server sends a HTTP response with a status code of 2 (200-299, 301, 302, 401) the LoadMaster closes the connection and marks the server as active.  If the server fails to respond within the configured response time for the configured number of times or if it responds with a different status code, it is assumed dead.  HTTP 1.0 and 1.1 support available, using HTTP 1.1 allows you to check host header enabled web servers.

Typical, you read the exact line of information you need AND understand it after having figured it out. Now linking that information (yes we always read all manuals completely Embarrassed smile) to the situation at hand isn’t always that fast a process but I got there in the end with some help from KEMP Technologies.

One hint is perhaps to mention this is in the handy tips that pop up when you hover over a setting in the LoadMaster console. I rely on this a lot and a mention of “HTTP 1.1 allows you to check host header enabled web servers” might have helped me out. But it’s not there. A very poor excuse I know … Embarrassed smile

image

Host Header Names & HTTP redirection

After having fix this issue I proceeded to configure HTTP redirect in IIS 7.5. For this is used two sites. One was just a fake site tied to the www.surveyors.lab hostname in IIS on port 80.

image

For this site I created a HTTP redirect to www.bussines.lab/surveyors/services. This works just fine as long as you don’t forget the http:// in the redirect URL.

image

So it has to be http://www.bussines.lab/surveyors/services or you’ll get a funky loop effect looking like this:

http://www.surveyors.lab/www.bussines.lab/surveyors/services/www.bussines.lab/surveyors/services/www.bussines.lab/surveyors/services

Firefox will tell you you have a loop that will never end but Internet Explorer doesn’t, it just fails. You do get that URL as a pointer to the cause of the issue. That is if you can relate it to that.

The other was the real site  and was configured with following bindings and without redirection.

image

Don’t forget to do this on all real servers in the farm! The next thing I need to find out is how to health check two host names in the LoadMaster as I have two websites with the same IP address, port but different host names.

WDeployConfigWriter Account Issues – Trouble Shooting Web Deploy 2.0 With Lessons Learned


Here’s a small recap of a trouble shooting incident we dealt with recently and that served as a coaching exercise for trouble shooting. It seems we have Web Deploy 2.0 in use for in house deployments of web apps. It seems to be a valued asset as well. At least valuable enough to land a help request on the desk of one of the young, eager, smart and upward mobile IT Professionals when it stops working and they need some assistance.

Hello ICT,

To deploy our we websites remotely we use web deployment service (see http://technet.microsoft.com/en-us/library/dd569087(WS.10).aspx for more info).

This service runs under the network service account by default. Deploying fails now. In the security log on the server I find  "The specified account’s password has expired".

Does anyone know the password of this account?

Best regards,

Hardworking Web Guy In Trouble

Basically we have enough information to know something went wrong and that they need it to work again. But that’s about it. Password for the network service account expired? They also included an error log and reading it learns us something. The lesson to be learned here: investigate yourself, read the log, interpret them. Don’t let patients give you a diagnosis. Their input is critical, but you need to draw your own conclusions.

An account failed to log on.

Subject:
                Security ID:                           LOCAL SERVICE
                Account Name:                    LOCAL SERVICE
                Account Domain:                NT AUTHORITY
                Logon ID:                              0x3e5

Logon Type:                                         8

Account For Which Logon Failed:
                Security ID:                           NULL SID
                Account Name:                    WDeployConfigWriter
                Account Domain:                lab.test

Failure Information:
                Failure Reason:                     The specified account’s password has expired.
                Status:                                0xc000006e
                Sub Status:                            0xc0000071

Process Information:
Caller Process ID: 0x1f44
Caller Process Name: C:\Windows\System32\inetsrv\WMSvc.exe

What did we just read and learn? No it’s not the Network Service Account whose password has expired. This doesn’t happen/doesn’t work that way … so that was our first indication that this isn’t quite right in the support ticket. As you can see the real problem account mentioned in the error log:  WDeployConfigWriter. That account is indeed a local account.

WdeployAccounst

 

Cool, now we check what service runs under that account by looking in the services panel …. none! The easy way to check is to sort on the "Log On As" column. You won’t find WDeployConfigWriter. Right … , what else do we learn from the Services panel. Well we do have service called Web Deployment Agent Service running under the local Network Service account. We can stop and start it just fine so there is nothing wrong with the Network Service account , which is as expected and this service is not our culprit.  What we also learn that this is Web Deploy 2.0.

Service

 

As the Web Deployment Agent Service has nothing to do with the problem at hand. So where is that WDeployConfigWriter being used and what is it status? Let’s take a look.

WdeployAccountsettings

 

Hey, how could this account have expired? This is impossible. Unless they changed it while trying to fix the error. We check this with  quick phone call and yes, they did exactly that.  The good thing is that this web guy is professional and tells us what they did. Some people think this might get them into trouble and won’t do that. It doesn’t change anything, things are what they are, but it does make communication less easy when you discover people act that way… So the lessons here are to double check & verify what happened if at all possible. Originally the settings were:

WDeployAccountOriginalSettings 

 

They changed them after they ran into issues hop that checking those options might fix it. Well no, expired is expired and you can’t fix it like that. You need indeed to correct the settings if you don’t want the password to expire and even prevent the user from changing it but you also need to set a new password when it has already expired. After doing so we contact the hardworking web guy in trouble to let ‘m test and predict a new error: whatever runs under that Account will now fail to run due to an incorrect password. And guess what? “Unknown user name or bad password” in the security log.

Log Name:      Security
Source:        Microsoft-Windows-Security-Auditing
Date:          24/06/2011 10:30:39
Event ID:      4625
Task Category: Logon
Level:         Information
Keywords:      Audit Failure
User:          N/A
Computer:     server1.lab.test
Description:
An account failed to log on.

Subject:
    Security ID:        LOCAL SERVICE
    Account Name:        LOCAL SERVICE
    Account Domain:        NT AUTHORITY
    Logon ID:        0x3e5

Logon Type:            8

Account For Which Logon Failed:
    Security ID:        NULL SID
    Account Name:        WDeployConfigWriter
    Account Domain:        lab.test

Failure Information:
    Failure Reason:        Unknown user name or bad password.
    Status:            0xc000006d
    Sub Status:        0xc000006a

Process Information:
    Caller Process ID:    0x1f44
    Caller Process Name:    C:\Windows\System32\inetsrv\WMSvc.exe

 

The user wants to repair install or uninstall and reinstall the application to “get a quick fix” but we do not to give in and keep trouble shooting. It’s better to learn what the cause really is and how to fix it instead of relying on wishful reinstalling.

So where is the thing that runs under that account. We start a quick search in the registry and on the file system for the  account name just in case it’s configured in the registry or a configuration file and let it run while we keep investigating.  We also send  a tweet in to the universe, as perhaps some one out there  knows this and can help out. We search the internet for Web Deploy 2.0 and WDeployConfigWriter. This results in very few hits, hmmm, interesting  … . One of them is http://blogs.iis.net/msdeploy/archive/2011/04/05/announcing-web-deploy-2-0-refresh.aspx

Where we learn a few things, the most important is the one line from that blog post I formatted in bold and red from the blog snippet right below. I also enlarged the picture from the blog post to make it readable where you can find in IIS  what we learned here:

Notice that Web Deploy setup created two new local user accounts:

- WDeployConfigWriter, which has Write permissions to the IIS server’s applicationHost.config. This is used by delegation rules for createApp, appPoolNetFx and appPoolPipelineMode.

I’ve included the entire block of text from where this was taken below.

1. Easier setup for non-administrator deployments on IIS7

One of the common requests from our users was to make it easier to setup Web Deploy so non-administrators can publish to their sites. Typically, you will need to do this if you are running a shared hosting environment or if you are administering a build machine and you do not want users to have admin access.

If you launch the Web Deploy installer and choose “Custom”, you will notice a new option, “Configure for Non-administrator Deployments”:

clip_image001

If you choose this option, Web Deploy will automatically create Management Service Delegation rules for the following providers, as well as user the accounts needed for providers like createApp and recycleApp that need elevated privileges.

These are the rules you will have in the Management Service Delegation UI in IIS Manager after you install this component:

Notice that Web Deploy setup created two new local user accounts:

- WDeployConfigWriter, which has Write permissions to the IIS server’s applicationHost.config. This is used by delegation rules for createApp, appPoolNetFx and appPoolPipelineMode.

- WDeployAdmin, which is an administrator. This is used by delegation rules for recycleApp.

If you prefer to create these rules by hand, uncheck the component in the installer. We also provide a PowerShell script for creating delegation rules (more on this later in the post) if you prefer that route.

Well armed with this information we go have a look at the Management Service Delegation:

ManagementServiceDelegation

 

Where we indeed find createApp, appPoolNetFx and appPoolPipelineMode:

ManagementServiceDelegationWebdeployconfig

 

So now we take a look a bit what we can configure here and  sure enough, by double clicking on them the Edit Rule form:

ManagementServiceDelegationWebdeployconfigSettings

 

So we click on Edit security credentials and are welcomed by this form:

ManagementServiceDelegationWebdeployconfigSettingsPW1

 

So we enter the account name and the new password we set before (remember to do this for both providers):

ManagementServiceDelegationWebdeployconfigSettingsPW2

 

Guess what, end user happy, things are working again. Jay! From service down report to helpdesk to fully operational again in less than an hour with a technology new to the service desk. Well done young, eager, smart and upward mobile IT Pro Winking smile with lessons learned.

How did this happen and did they end up with this funky configuration (expiring password of an account that no one knows where it is used for and where configured)? Aha, operational control => know the configuration of what you use and know why it is configured that way and where it’s configured. Is it a mistake/assumption in the installer that the accounts WDeployConfigWriter and WDeployAdmin have their passwords set to expired and can be changed by the user or did somebody mess with them after the install? Well I did the test by setting it up on a test server and found that they are indeed installed with their passwords set to expire and that the password can be changed by the user. It assumes that the person doing the install knows and realizes the implications. I’m not saying either setting is wrong but you should know why, when and where. There is no documentation on this as far as we could find right now and perhaps the installer should mention the benefits/risks of both types of configuration and ask what to choose. This, together with better documentation, could help prevent this issue. As always, no guarantees given Winking smile 

Overall lesson: don’t assume things, trust but verify …