Analyzing 0x9E BSOD–Part III

Hello ,

It has been a long time coming to finish up this third part of the blog. Well, today we had post at ntdebugging blog about 0x9E – http://blogs.msdn.com/b/ntdebugging/archive/2014/08/27/windows-troubleshooting-stop-9e-explained.aspx – and I could not let pass this chance to finish up the third part where we chase down IRP.

Just a good reminder that there are events that are undeniable about their origin. For all us folks in the trenches we have seen Disk events 153 and 129. Be sure to recognize them and understand the meaning.

Please review the following blogs:

Event 153 – http://blogs.msdn.com/b/ntdebugging/archive/2013/04/30/interpreting-event-153-errors.aspx

Event 129 – http://blogs.msdn.com/b/ntdebugging/archive/2011/05/06/understanding-storage-timeouts-and-event-129-errors.aspx

Now back to the 0x9E debugging . From the previous blog post we saw  System Worker Thread exhaustion from the !exqueue extension and threads with pending IO.

6: kd> !irp fffffa802fd05790
Irp is active with 12 stacks 7 is current (= 0xfffffa802fd05a10)
Mdl=fffffa8055b40bd0: No System Buffer: Thread fffffa8024c5b360:  Irp stack trace.
cmd  flg cl Device   File     Completion-Context
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
>[  4,34]  1c e0 fffffa80537a1790 00000000 fffff880012831b0-00000000 Success Error Cancel
\Driver\Disk    partmgr!PmReadWriteCompletion
Args: 00001000 00000000 bbd82000 00000000
[  4, 0]  1c e0 fffffa80537a2040 00000000 fffff88001293010-fffffa80537e0e20 Success Error Cancel
\Driver\partmgr    volmgr!VmpReadWriteCompletionRoutine
Args: 613ffaf49d 00000000 bbd82000 00000000
[  4, 0]   c e0 fffffa80537e0cd0 00000000 fffff880011c41b0-fffffa803283a5d0 Success Error Cancel
\Driver\volmgr    volsnap!VspWriteContextCompletionRoutine
Args: 00001000 00000000 613ffaf49b 00000000
[  4, 0]   c e1 fffffa8053852030 00000000 fffff88001823fe4-fffff8800252dde8 Success Error Cancel pending
\Driver\volsnap    Ntfs!NtfsMasterIrpSyncCompletionRoutine
Args: 00001000 00000000 bbd72000 00000000
[  4, 0]   0 e0 fffffa8053a55030 fffffa8053a45630 fffff88001420e40-fffffa8055548570 Success Error Cancel
\FileSystem\Ntfs    fltmgr!FltpPassThroughCompletion
Args: 00001000 00000000 03e92000 00000000
[  4, 0]   0  1 fffffa8053a42380 fffffa8053a45630 00000000-00000000    pending
           \FileSystem\FltMgr
Args: 00001000 00000000 03e92000 00000000

At this point we need to make sure we capture the original IRP from this IO packet in order to point out the part in the stack the IO is really pending.

To do that we follow the instructions listed in this blog – Where Did my Disk I/O Go?

Here it is an extract from the blog:

“Hello, Mr. Ninja back again. I recently discovered that although my team often tracks I/O from the file system through to the disk controller, we have never publicly documented the steps required to do this. This seems like a great opportunity for a blog because most of the structures are known, and they are even included in the public symbols.

When debugging a system that is hung, slow, or otherwise unresponsive you will likely encounter an IRP that has been sent from ntfs to the disk. Running !irp against such a request will show that the request has gone to disk.sys, but that is not really where the story ends.”

At the end we found associated IRP with this structure. I just pasting the result from my dump analysis. Be sure to review the blog and it shows all the steps .

6: kd> !irp 0xfffffa80`555ac430
Irp is active with 5 stacks 6 is current (= 0xfffffa80555ac668)
Mdl=fffffa8053b6b2f0: No System Buffer: Thread 00000000:  Irp is completed.  Pending has been returned
cmd  flg cl Device   File     Completion-Context
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  f, 0]   0  0 fffffa80305768b0 00000000 fffff88001259290-fffffa80548d5670
\Driver\iScsiPrt    mpio!MPIOPdoCompletion
—> LAST STACK in the iSCSIPRT
Args: 00000000 00000000 00000000 fffffa8030576a00
[  f, 0]   0  0 fffffa8053651060 00000000 fffff880014a8f8c-fffffa8054579010
\Driver\mpio    EmcpBase
Args: 00000000 00000000 00000000 fffffa80548d5670
[  f, 0]   0  0 fffffa8053704ca0 00000000 fffff88001468a00-fffffa805564fd00
\Driver\EmcpBase    CLASSPNP!TransferPktComplete
Args: 00000000 00000000 00000000 00000000

In this case we had several IO where the last part of the stack was the iSCSIPRT , meaning that the IO had actually been sent to the storage array and we are waiting for a response.

By checking the event viewer we had several Events 129 , in this case instead of Disk iSCSIPrt , here it is a snip from the 129 Event:

System

– Provider

[ Name]  iScsiPrt

– EventID 129

– EventData

\Device\RaidPort1

Binary data:

In Words

0000: 0018000F 00000001 00000000 80040081
0010: 00000004 00000000 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 000F0100 80040081 00000000 00000000

Clear from the error that we had an IO RESET by using Error tool

# for hex 0x80040081 / decimal -2147221375
IO_WARNING_RESET                                               ntiologc.h

 

And another clue is that in the dump we had the following function in a couple of stacks:

storport!RaidUnitResetUnit , to find all use !stacks 2 storport!RaidUnitResetUnit .

All that exercise is to demonstrate that BSOD 0x9E is a more of a hang dump and you need to use hang analysis techniques to isolate the underlying cause of the monitor timeout.

Thanks to you all and Keep Debugging.

Alessandro

Advertisements

About smartwindows

Support professional for Microsoft technologies with interest in Performance and Debugging
This entry was posted in BSOD, Hang and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s