Analyzing 0x9E BSOD–Part I

Now that the World Cup is over… By the way Congratulations to all Germans !!

Back to debugging and blogging …

One very popular bugcheck I face is the 0x9E – USER_MODE_HEALTH_MONITOR – that causes some misleading interpretation on the root cause of the crash. Basically is a user mode component that failed to satisfy a health check. It is a common thing on Windows Failover Clustering feature as stated in the following blog – Why is my Failover Clustering node blue screening with a Stop 0x9E ?

But my question is , can you collect information from a kernel memory dump to help you troubleshoot what caused the issue, and the answer is likely YES!

I am going to show some of the kernel dumps with 0x9E I have analyzed lately.

First you need to understand the anatomy of this bugcheck :

6: kd> !analyze -show 0x9E
USER_MODE_HEALTH_MONITOR (9e)
One or more critical user mode components failed to satisfy a health check.
Hardware mechanisms such as watchdog timers can detect that basic kernel
services are not executing. However, resource starvation issues, including
memory leaks, lock contention, and scheduling priority misconfiguration,
may block critical user mode components without blocking DPCs or
draining the nonpaged pool.
Kernel components can extend watchdog timer functionality to user mode
by periodically monitoring critical applications. This bugcheck indicates
that a user mode health check failed in a manner such that graceful
shutdown is unlikely to succeed. It restores critical services by
rebooting and/or allowing application failover to other servers.
Arguments:
Arg1: 0000000000000000, Process that failed to satisfy a health check within the
configured timeout
Arg2: 0000000000000000, Health monitoring timeout (seconds)
Arg3: 0000000000000000
Arg4: 0000000000000000

Basically something is not right and we need to recover ASAP so we are going to crash the box, lately we are seeing a lot of this workaround from Exchange ( 0xEF and 0xF4 ) and cluster service with the assumption we have several other nodes to pick up the load and we don’t want to hang until it goes away or the admin restarts the box.

So we are dealing with a kernel hang dump and just treat it as such with an added bonus that it is telling the process that failed to satisfy the timeout and threshold. Definitely it would be awesome to have user mode memory address from the process, but setting a full memory dump can consume a lot of disk space and definitely it is not going to be the first type of dump you are going to receive.

As far as hang dumps are concerned, I always use these extensions:

!vm , !ready, !locks, !running –it , !dpcs , !exqueue /f

By using all of those extensions you collect basic information about the state of the server at the moment of the crash.

In some dumps we see right away that there is something blocking.

In one of my analysis I had !exqueue pointing to several PENDING IO due to System Working Threads Exhaustion :

 

6: kd> !exqueue /f
**** Critical WorkQueue ( Threads: 26/512, Concurrency: 0/12 )
THREAD fffffa8024c5b360  Cid 0004.001c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT

*

*

*

PENDING: IoWorkItem (fffffa80538a3730) Routine usbccgp!UsbcWorkerFunction (fffff8800512f288) IoObject (fffffa803041c060) Context (fffffa80309c24f0)
PENDING: ExWorkItem (fffff80006678200) Routine nt!ObpProcessRemoveObjectQueue (fffff80006779398) Parameter (0000000000000000)
PENDING: ExWorkItem (fffffa80553ebdc0) Routine NDIS!ndisWorkItemHandler (fffff88001155010) Parameter (fffffa80553ebdb0)
PENDING: IoWorkItem (fffffa8054450950) Routine CLFS!CClfsContainer::WorkRoutine (fffff88000d3adc0) IoObject (fffffa8052a09780) Context (fffffa8025202e80)
PENDING: ExWorkItem (fffff8800186d050) Routine Ntfs!NtfsFspClose (fffff8800189c1cc) Parameter (0000000000000000)
PENDING: ExWorkItem (fffffa802fe4e380) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa802fe4e370)
PENDING: IoWorkItem (fffffa8055a62e60) Routine msiscsi!iSpPrematureSessionTerminationWorker (fffff88004f0125c) IoObject (fffffa80252df3b0) Context (fffffa805574c390)
PENDING: ExWorkItem (fffffa8054400a80) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa8054400a70)
PENDING: ExWorkItem (fffffa8055d290b0) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa8055d290a0)
PENDING: ExWorkItem (fffffa805467e690) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa805467e680)
PENDING: IoWorkItem (fffffa805377f1f0) Routine partmgr!PmNotificationWorkItem (fffff8800127f830) IoObject (fffffa80537acb90) Context (0000000000000000)
PENDING: ExWorkItem (fffffa8033c0e170) Routine NDIS!ndisWorkItemHandler (fffff88001155010) Parameter (fffffa8033c0e160)
PENDING: ExWorkItem (fffffa8032534ef0) Routine NDIS!ndisWorkItemHandler (fffff88001155010) Parameter (fffffa8032534ee0)
PENDING: ExWorkItem (fffffa8042bdf240) Routine NDIS!ndisWorkItemHandler (fffff88001155010) Parameter (fffffa8042bdf230)
PENDING: ExWorkItem (fffffa80378479c0) Routine NDIS!ndisWorkItemHandler (fffff88001155010) Parameter (fffffa80378479b0)
PENDING: ExWorkItem (fffff8800186d1d8) Routine Ntfs!NtfsCheckUsnTimeOut (fffff8800188b02c) Parameter (0000000000000000)
PENDING: IoWorkItem (fffffa8053642d00) Routine partmgr!PmNotificationWorkItem (fffff8800127f830) IoObject (fffffa80537a4040) Context (0000000000000000)
PENDING: IoWorkItem (fffffa805367f4b0) Routine partmgr!PmNotificationWorkItem (fffff8800127f830) IoObject (fffffa80537a79e0) Context (0000000000000000)
PENDING: ExWorkItem (fffffa8053f866d0) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa8053f866c0)
PENDING: ExWorkItem (fffffa8033fef370) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa8033fef360)

At this point we know that the server is in hang state and if you follow this clue and resolve why we have some many pending I/O’s you will also get to the bottom of the 0x9E bugcheck.

Since this blog was getting too long, I will follow up later with this same dump and point where we had the issue.

Thanks and keep on debugging,

Alessandro

Advertisements

About smartwindows

Support professional for Microsoft technologies with interest in Performance and Debugging
This entry was posted in BSOD, Hang and tagged . Bookmark the permalink.

One Response to Analyzing 0x9E BSOD–Part I

  1. Pingback: Analyzing 0x9E BSOD–Part II | SmartWindows

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s