Chasing the White Rabbit

For years I have been working with Windows operating systems and one point I really like to dig in is performance.

The cool thing about working with Performance is that you are chasing a White Rabbit…

It is not looking at a single value and declare there is a problem, but look for patterns and interactions that explain a certain phenomenon or perception.

I hope to add in the following posts a little of my dealings with Windows Performance cases and down to the rabbit hole.

Good hunting in 2016.

Alice___White_Rabbit_by_michaelkutsche

Posted in White Rabbit | Leave a comment

Verifier and Empty Registry key

Well first blog from the Land of 10,000 lakes. Somebody had a server with a continuous boot cycle due to a crash… NICE Smile

Upon checking the bugcheck code 0xC4 – Verifier crash… It has been a long time I did not had a pool corruption to look at. However the stack was very interesting

fffff880`02393488 fffff800`021604ec nt!KeBugCheckEx
fffff880`02393490 fffff800`02160f2b nt!VerifierBugCheckIfAppropriate+0x3c
fffff880`023934d0 fffff800`02171bc8 nt!ExAllocatePoolSanityChecks+0xcb
fffff880`02393510 fffff800`02171e37 nt!VeAllocatePoolWithTagPriority+0x88
fffff880`02393580 fffff880`05bc35a1 nt!VerifierExAllocatePoolWithTagPriority+0x17
fffff880`023935c0 fffff880`05bc27bb tcpipreg!InterfaceAddressRegKeyChangeHandler+0x109
fffff880`023936f0 fffff880`05bc1a59 tcpipreg!TcpipRegQueryAndUpdateKeyValue+0x363
fffff880`02393780 fffff880`00e55754 tcpipreg!TcpipRegStartRegistryKeyNotification+0xbd
fffff880`023937d0 fffff880`05bc2293 NETIO!RtlInvokeStartRoutines+0x34
fffff880`02393810 fffff800`020bdba7 tcpipreg!DriverEntry+0x257
fffff880`02393860 fffff800`020bdfa5 nt!IopLoadDriver+0xa07
fffff880`02393b30 fffff800`01cdaa95 nt!IopLoadUnloadDriver+0x55
fffff880`02393b70 fffff800`01f6fb8a nt!ExpWorkerThread+0x111
fffff880`02393c00 fffff800`01cc28e6 nt!PspSystemThreadStartup+0x5a
fffff880`02393c40 00000000`00000000 nt!KxStartSystemThread+0x16

Pointing to tcpipreg module. However it did not make much sense and I request the technician why verifier was enabled in the first place just to check if we had 0x19 , 0xC2 crashes. But not of those were happening and by having a little experience with people being trigger happy with verifier and not understanding the consequences of it, we just disable it and main issue of reboot cycle was fixed.

Since I had a fresh dump and plenty of time after running outside in zero degree weather nothing like a debug session to warm up the spirit.

First !analyze pointed to tcpipreg – km:x64_0xc4_0_vrf_tcpipreg!interfaceaddressregkeychangehandler+109

Well by checking the output of !verifier , it was doing automatic check and Special Pool :

 

9: kd> !verifier

Verify Flags Level 0x00000001

  STANDARD FLAGS:
    [X] (0x00000000) Automatic Checks
    [X] (0x00000001) Special pool
    [ ] (0x00000002) Force IRQL checking
    [ ] (0x00000008) Pool tracking
    [ ] (0x00000010) I/O verification
    [ ] (0x00000020) Deadlock detection
    [ ] (0x00000080) DMA checking
    [ ] (0x00000100) Security checks
    [ ] (0x00000800) Miscellaneous checks

  ADDITIONAL FLAGS:
    [ ] (0x00000004) Randomized low resources simulation
    [ ] (0x00000200) Force pending I/O requests
    [ ] (0x00000400) IRP logging

    [X] Indicates flag is enabled

Summary of All Verifier Statistics

  RaiseIrqls           0x0
  AcquireSpinLocks     0x1616566
  Synch Executions     0x3508
  Trims                0x0

  Pool Allocations Attempted             0x58f15
  Pool Allocations Succeeded             0x58f15
  Pool Allocations Succeeded SpecialPool 0x58f15
  Pool Allocations With NO TAG           0x4
  Pool Allocations Failed                0x0

  Current paged pool allocations         0x0 for 00000000 bytes
  Peak paged pool allocations            0x0 for 00000000 bytes
  Current nonpaged pool allocations      0x0 for 00000000 bytes
  Peak nonpaged pool allocations         0x0 for 00000000 bytes

Well checking the stack and looking for TAG allocation

nt!VeAllocatePoolWithTagPriority+0x88

nt!VerifierExAllocatePoolWithTagPriority+0x17

In the complete arguments we find the POOL allocation

fffff880`02393580 fffff880`05bc35a1 : 00000000`00000000 00000000`00000000 fffff980`12d82fd0 fffff800`0216d15c : nt!VerifierExAllocatePoolWithTagPriority+0x17

9: kd> !pool fffff980`12d82fd0
Pool page fffff98012d82fd0 region is Special pool
Address fffff98012d82000 does not belong to any pool
*fffff98012d82000 size:   22 data: fffff98012d82fd0 (NonPaged) *LeGe
       Pooltag LeGe : Legacy Registry Mapping Module Buffers, Binary : tcpip.sys

It points to tcpip.sys driver owning the tag. Whenever you don’t have the pooltag mapped correctly you can search for it in the dump by using !for_each_module

9: kd> !for_each_module s -a @#Base @#End “LeGe” –> FOUR characters for TAG (fill with spaces if tag less than )

fffff880`05bc56fd  4c 65 47 65 49 8b d6 33-c9 44 89 5c 24 40 ff 15  LeGeI..3.D.\$@..
fffff880`05bc5744  4c 65 47 65 45 33 c9 49-8b d6 33 c9 44 8b c3 ff  LeGeE3.I..3.D…
fffff880`05bc5b73  4c 65 47 65 44 89 bc 24-f8 00 00 00 ff 15 c3 15  LeGeD..$…..

 

Then you need to use lm a <base address>

9: kd> lm a fffff880`05bc5b73
start             end                 module name
fffff880`05bbf000 fffff880`05bd1000   tcpipreg   (pdb symbols)

In this case you confirm that it belongs to tcpipreg one of drivers using the LeGe TAG.

But it not make much sense in the case the tcpip.sys and tcpipreg.sys where in the latest public available for Windows Server 2008 R2.

By searching the web we come across the following article :

http://support.microsoft.com/kb/2550576/en-us

“When Driver Verifier (Verifier.exe) is enabled on the tcpipreg.sys driver, the STOP 0xC4 occurs following a NULL length memory allocation.”

“This problem has only been observed after the machine’s TCP/IP configuration has been changed from a static IP address to a dynamic address with DHCP, which leaves an empty value for the IPAddress registry key for the network interface.”

Well it matched the issue we were having. But since we had time and a good dump is always good to look and retrieve more information and create a bag of tricks. In this case it will be registry information with !reg extension.

First we need to find which key is being manipulated, lets look again at the stack

9: kd> k
Child-SP          RetAddr           Call Site
fffff880`02393488 fffff800`021604ec nt!KeBugCheckEx
fffff880`02393490 fffff800`02160f2b nt!VerifierBugCheckIfAppropriate+0x3c
fffff880`023934d0 fffff800`02171bc8 nt!ExAllocatePoolSanityChecks+0xcb
fffff880`02393510 fffff800`02171e37 nt!VeAllocatePoolWithTagPriority+0x88
fffff880`02393580 fffff880`05bc35a1 nt!VerifierExAllocatePoolWithTagPriority+0x17
fffff880`023935c0 fffff880`05bc27bb tcpipreg!InterfaceAddressRegKeyChangeHandler+0x109
fffff880`023936f0 fffff880`05bc1a59 tcpipreg!TcpipRegQueryAndUpdateKeyValue+0x363
fffff880`02393780 fffff880`00e55754 tcpipreg!TcpipRegStartRegistryKeyNotification+0xbd
fffff880`023937d0 fffff880`05bc2293 NETIO!RtlInvokeStartRoutines+0x34
fffff880`02393810 fffff800`020bdba7 tcpipreg!DriverEntry+0x257
fffff880`02393860 fffff800`020bdfa5 nt!IopLoadDriver+0xa07
fffff880`02393b30 fffff800`01cdaa95 nt!IopLoadUnloadDriver+0x55
fffff880`02393b70 fffff800`01f6fb8a nt!ExpWorkerThread+0x111
fffff880`02393c00 fffff800`01cc28e6 nt!PspSystemThreadStartup+0x5a
fffff880`02393c40 00000000`00000000 nt!KxStartSystemThread+0x16

The first function resembling calling registry

fffff880`02393780 fffff880`00e55754 : fffff880`05bc7a60 00000000`00000004 00000000`00000000 fffff880`05bca318 : tcpipreg!TcpipRegStartRegistryKeyNotification+0xbd

Dump the memory location and you see a registry key path over there.

9: kd> dc fffff880`05bc7a60+100
fffff880`05bc7b60  006f0043 0074006e 006f0072 0053006c  C.o.n.t.r.o.l.S.
fffff880`05bc7b70  00740065 0053005c 00720065 00690076  e.t.\.S.e.r.v.i.
fffff880`05bc7b80  00650063 005c0073 00630054 00690070  c.e.s.\.T.c.p.i.
fffff880`05bc7b90  005c0070 00610050 00610072 0065006d  p.\.P.a.r.a.m.e.
fffff880`05bc7ba0  00650074 00730072 0049005c 0074006e  t.e.r.s.\.I.n.t.
fffff880`05bc7bb0  00720065 00610066 00650063 00000073  e.r.f.a.c.e.s…
fffff880`05bc7bc0  0052005c 00670065 00730069 00720074  \.R.e.g.i.s.t.r.
fffff880`05bc7bd0  005c0079 0061004d 00680063 006e0069  y.\.M.a.c.h.i.n.

Now it is time to check our registry hive with !reg

First do a hive list to check the HiveAddr for SYSTEM

9: kd> !reg hivelist

——————————————————————————————————————————————————-
|     HiveAddr     |Stable Length|    Stable Map    |Volatile Length|    Volatile Map    |MappedViews|PinnedViews|U(Cnt)|     BaseBlock     | FileName
——————————————————————————————————————————————————-
| fffff8a000014010 |       1000  | fffff8a0000140c8 |       1000    |  fffff8a000014340  |        0  |        0  |     0| fffff8a000015000  | <NONAME>
| fffff8a000020410 |     e3b000  | fffff8a000027000 |      42000    |  fffff8a000020740  |        0  |        0  |     0| fffff8a000021000  | SYSTEM
| fffff8a000058320 |       f000  | fffff8a0000583d8 |      11000    |  fffff8a000058650  |        0  |        0  |     0| fffff8a000068000  | <NONAME>

 

 

After that list the open keys under SYSTEM and look for keys under the TCPIP\Parameters

!reg openkeys fffff8a000020410

Index 50:      43de5ac1 kcb=fffff8a000d6f380 cell=00075420 f=00200000 \REGISTRY\MACHINE\SYSTEM\CONTROLSET001\SERVICES\TCPIP\PARAMETERS

With that you can query the value of the keys and see that some of the keys in the TCPIP\INTERFACES were empty. Which most likely matches the symptom in the article.

9: kd> !reg querykey \REGISTRY\MACHINE\SYSTEM\CONTROLSET001\SERVICES\TCPIP\PARAMETERS\INTERFACES

Found KCB = fffff8a00184a5d0 :: \REGISTRY\MACHINE\SYSTEM\CONTROLSET001\SERVICES\TCPIP\PARAMETERS\INTERFACES

Hive         fffff8a000020410
KeyNode      fffff8a0053a93ac

[SubKeyAddr]         [SubKeyName]
fffff8a000b910cc     {065f0c42-703a-11de-9954-806e6f6e6963}
fffff8a000b910cc     {065f0c42-703a-11de-9954-806e6f6e6963}
fffff8a000b910cc     {065f0c42-703a-11de-9954-806e6f6e6963}

Use ‘!reg keyinfo fffff8a000020410 <SubKeyAddr>’ to dump the subkey details

[ValueType]         [ValueName]                   [ValueData]
Key has no Values
9: kd> !reg keyinfo fffff8a000020410 fffff8a000b910cc

KeyPath     \REGISTRY\MACHINE\SYSTEM\ControlSet001\services\Tcpip\Parameters\Interfaces\{065f0c42-703a-11de-9954-806e6f6e6963}

[ValueType]         [ValueName]                   [ValueData]
 Key has no Values

 

Well that ends this blog. Just make sure every time you request that something as intrusive as driver verifier, you know the consequences and why you are actually requesting this.

Good hunting,

Alessandro

 

Posted in !verifier, WinDBG Trick | Tagged , | 2 Comments

Analyzing 0x9E BSOD–Part III

Hello ,

It has been a long time coming to finish up this third part of the blog. Well, today we had post at ntdebugging blog about 0x9E – http://blogs.msdn.com/b/ntdebugging/archive/2014/08/27/windows-troubleshooting-stop-9e-explained.aspx – and I could not let pass this chance to finish up the third part where we chase down IRP.

Just a good reminder that there are events that are undeniable about their origin. For all us folks in the trenches we have seen Disk events 153 and 129. Be sure to recognize them and understand the meaning.

Please review the following blogs:

Event 153 – http://blogs.msdn.com/b/ntdebugging/archive/2013/04/30/interpreting-event-153-errors.aspx

Event 129 – http://blogs.msdn.com/b/ntdebugging/archive/2011/05/06/understanding-storage-timeouts-and-event-129-errors.aspx

Now back to the 0x9E debugging . From the previous blog post we saw  System Worker Thread exhaustion from the !exqueue extension and threads with pending IO.

6: kd> !irp fffffa802fd05790
Irp is active with 12 stacks 7 is current (= 0xfffffa802fd05a10)
Mdl=fffffa8055b40bd0: No System Buffer: Thread fffffa8024c5b360:  Irp stack trace.
cmd  flg cl Device   File     Completion-Context
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
>[  4,34]  1c e0 fffffa80537a1790 00000000 fffff880012831b0-00000000 Success Error Cancel
\Driver\Disk    partmgr!PmReadWriteCompletion
Args: 00001000 00000000 bbd82000 00000000
[  4, 0]  1c e0 fffffa80537a2040 00000000 fffff88001293010-fffffa80537e0e20 Success Error Cancel
\Driver\partmgr    volmgr!VmpReadWriteCompletionRoutine
Args: 613ffaf49d 00000000 bbd82000 00000000
[  4, 0]   c e0 fffffa80537e0cd0 00000000 fffff880011c41b0-fffffa803283a5d0 Success Error Cancel
\Driver\volmgr    volsnap!VspWriteContextCompletionRoutine
Args: 00001000 00000000 613ffaf49b 00000000
[  4, 0]   c e1 fffffa8053852030 00000000 fffff88001823fe4-fffff8800252dde8 Success Error Cancel pending
\Driver\volsnap    Ntfs!NtfsMasterIrpSyncCompletionRoutine
Args: 00001000 00000000 bbd72000 00000000
[  4, 0]   0 e0 fffffa8053a55030 fffffa8053a45630 fffff88001420e40-fffffa8055548570 Success Error Cancel
\FileSystem\Ntfs    fltmgr!FltpPassThroughCompletion
Args: 00001000 00000000 03e92000 00000000
[  4, 0]   0  1 fffffa8053a42380 fffffa8053a45630 00000000-00000000    pending
           \FileSystem\FltMgr
Args: 00001000 00000000 03e92000 00000000

At this point we need to make sure we capture the original IRP from this IO packet in order to point out the part in the stack the IO is really pending.

To do that we follow the instructions listed in this blog – Where Did my Disk I/O Go?

Here it is an extract from the blog:

“Hello, Mr. Ninja back again. I recently discovered that although my team often tracks I/O from the file system through to the disk controller, we have never publicly documented the steps required to do this. This seems like a great opportunity for a blog because most of the structures are known, and they are even included in the public symbols.

When debugging a system that is hung, slow, or otherwise unresponsive you will likely encounter an IRP that has been sent from ntfs to the disk. Running !irp against such a request will show that the request has gone to disk.sys, but that is not really where the story ends.”

At the end we found associated IRP with this structure. I just pasting the result from my dump analysis. Be sure to review the blog and it shows all the steps .

6: kd> !irp 0xfffffa80`555ac430
Irp is active with 5 stacks 6 is current (= 0xfffffa80555ac668)
Mdl=fffffa8053b6b2f0: No System Buffer: Thread 00000000:  Irp is completed.  Pending has been returned
cmd  flg cl Device   File     Completion-Context
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  f, 0]   0  0 fffffa80305768b0 00000000 fffff88001259290-fffffa80548d5670
\Driver\iScsiPrt    mpio!MPIOPdoCompletion
—> LAST STACK in the iSCSIPRT
Args: 00000000 00000000 00000000 fffffa8030576a00
[  f, 0]   0  0 fffffa8053651060 00000000 fffff880014a8f8c-fffffa8054579010
\Driver\mpio    EmcpBase
Args: 00000000 00000000 00000000 fffffa80548d5670
[  f, 0]   0  0 fffffa8053704ca0 00000000 fffff88001468a00-fffffa805564fd00
\Driver\EmcpBase    CLASSPNP!TransferPktComplete
Args: 00000000 00000000 00000000 00000000

In this case we had several IO where the last part of the stack was the iSCSIPRT , meaning that the IO had actually been sent to the storage array and we are waiting for a response.

By checking the event viewer we had several Events 129 , in this case instead of Disk iSCSIPrt , here it is a snip from the 129 Event:

System

– Provider

[ Name]  iScsiPrt

– EventID 129

– EventData

\Device\RaidPort1

Binary data:

In Words

0000: 0018000F 00000001 00000000 80040081
0010: 00000004 00000000 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 000F0100 80040081 00000000 00000000

Clear from the error that we had an IO RESET by using Error tool

# for hex 0x80040081 / decimal -2147221375
IO_WARNING_RESET                                               ntiologc.h

 

And another clue is that in the dump we had the following function in a couple of stacks:

storport!RaidUnitResetUnit , to find all use !stacks 2 storport!RaidUnitResetUnit .

All that exercise is to demonstrate that BSOD 0x9E is a more of a hang dump and you need to use hang analysis techniques to isolate the underlying cause of the monitor timeout.

Thanks to you all and Keep Debugging.

Alessandro

Posted in BSOD, Hang | Tagged | Leave a comment

Bugcheck 0x124–Hardware error has occurred

There are some bugcheck codes that scream the root cause right in their description and 0x124 is one of them.

13: kd> !analyze -show 0x124
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error condition.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: 0000000000000000, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000000000, Low order 32-bits of the MCi_STATUS value.

Whenever we see this bugcheck you can be sure that some hardware component failed and we need to find it out. Luckily this is rather simple today with WHEA – Windows Hardware Error Architecture .

Suppose you received the following bugcheck

13: kd> .bugcheck
Bugcheck code 00000124
Arguments 00000000`00000005 fffffa80`0d514028 00000000`00000000 00000000`00000000

The first thing is to dump the Error Record structure from the bugcheck with !errrec extension.

13: kd> !errrec fffffa800d514028
===============================================================================
Common Platform Error Record @ fffffa800d514028
——————————————————————————-
Record Id     : 01cfa323af4475e4
Severity      : Fatal (1)
Length        : 408
Creator       : Microsoft
Notify Type   : Generic
Timestamp     : 7/20/2014 23:56:37 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : PCI Express
——————————————————————————-
Descriptor    @ fffffa800d5140a8
Section       @ fffffa800d5140f0
Offset        : 200
Length        : 208
Flags         : 0x00000001 Primary
Severity      : Fatal

Port Type     : Root Port
Version       : 1.0
Command/Status: 0x0546/0x4010
Device Id     :
VenId:DevId : 8086:3410 —> DEVICE REPORTING THE PROBLEM
Class code  : 060400
Function No : 0x00
Device No   : 0x09
Segment     : 0x0000
Primary Bus : 0x00
Second. Bus : 0x07
Slot        : 0x0000
Sec. Status   : 0x6000
Bridge Ctl.   : 0x0007
Express Capability Information @ fffffa800d514124
Device Caps : 00008021 Role-Based Error Reporting: 1
Device Ctl  : 012c UR FE nf ce
Dev Status  : 0004 ur FE nf ce
Root Ctl   : 000e FS NFS cs

If you want to confirm check the time reported in the entry with the bugcheck time with .time

With the vendor and device ID you can find it from !pcitree extension .

Here it is the section with the device

(d=9,  f=0) 80863410 devext 0xfffffa800a2061b0 devstack 0xfffffa800a206060 0604 Bridge/PCI to PCI
Bus 0x7 (FDO Ext fffffa800a227e00)
(d=0,  f=0) 111d8061 devext 0xfffffa800a276b60 devstack 0xfffffa800a276a10 0604 Bridge/PCI to PCI
Bus 0x8 (FDO Ext fffffa800a266850)
(d=2,  f=0) 111d8061 devext 0xfffffa800a27bb60 devstack 0xfffffa800a27ba10 0604 Bridge/PCI to PCI
Bus 0x9 (FDO Ext fffffa800a273850)
(d=0,  f=0) 14e41639 devext 0xfffffa800a2801b0 devstack 0xfffffa800a280060 0200 Network Controller/Ethernet
(d=0,  f=1) 14e41639 devext 0xfffffa800a2811b0 devstack 0xfffffa800a281060 0200 Network Controller/Ethernet
(d=4,  f=0) 111d8061 devext 0xfffffa800a27cb60 devstack 0xfffffa800a27ca10 0604 Bridge/PCI to PCI
Bus 0xa (FDO Ext fffffa800a27c7b0)
(d=0,  f=0) 14e41639 devext 0xfffffa800a284b60 devstack 0xfffffa800a284a10 0200 Network Controller/Ethernet
(d=0,  f=1) 14e41639 devext 0xfffffa800a285b60 devstack 0xfffffa800a285a10 0200 Network Controller/Ethernet

In this case it is a PCI Bridge , and by the devices attached to it ( Network Controller ) you can determine which PCI bridge you are investigating.

It is worth noticing that WHEA entries are stored in the event viewer, so you actually see all the entries much easier: Applications and Services Logs\Microsoft\Windows\Kernel-WHEA .

Before replacing make sure you are in the latest firmware for the devices and motherboard according to vendor. If you continue having the issue and it is possible to isolate the hardware in the bridge by removing one at a time.

This technique is similar from a previous post I had with pci error and you can use it to further identify the issue.

Thanks and good hunting,

Alessandro

Posted in WinDBG Trick | Tagged | Leave a comment

Analyzing 0x9E BSOD–Part II

Continuing some tips from the previous blog on troubleshooting a 0x9E BSOD, I mentioned that we should use hang dump analysis tactics to this particular blue screen.

Using the same kernel dump from the blog, we checked that !exqueue /f showed several Pending I/O’s. This is a key symptom to any hang dump and besides that we need to understand what does it mean? Here it is a good read from msdn – http://msdn.microsoft.com/en-us/library/windows/hardware/ff564587(v=vs.85).aspx .

Good extract to understand it:

A DPC that needs to initiate a processing task that requires lengthy processing or that makes a blocking call should delegate the processing of that task to one or more work items. While a DPC runs, all threads are prevented from running. Additionally, a DPC, which runs at IRQL = DISPATCH_LEVEL, must not make blocking calls. However, the system worker thread that processes a work item runs at IRQL = PASSIVE_LEVEL. Thus, the work item can contain blocking calls. For example, a system worker thread can wait on a dispatcher object.

Because the pool of system worker threads is a limited resource, WorkItem and WorkItemEx routines can be used only for operations that take a short period of time. If one of these routines runs for too long (if it contains an indefinite loop, for example) or waits for too long, the system can deadlock. Therefore, if a driver requires long periods of delayed processing, it should instead call PsCreateSystemThread to create its own system thread.”

If you check the output from the exqueue /f, you notice a lot of PENDING IO which means there are no more System Worker Threads to process those, we need to check how long the threads in WAIT State are there.

Looking at the output

6: kd> !exqueue
**** Critical WorkQueue ( Threads: 26/512, Concurrency: 0/12 )
THREAD fffffa8024c5b360  Cid 0004.001c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8024c4b040  Cid 0004.0020  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8024c4bb50  Cid 0004.0024  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8024c4b660  Cid 0004.0028  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8024c4a040  Cid 0004.002c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8024c4ab50  Cid 0004.0030  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8024c4a660  Cid 0004.0034  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8024c49040  Cid 0004.0038  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8024c49b50  Cid 0004.003c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8024c49660  Cid 0004.0040  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa803d6d3b50  Cid 0004.2e10  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8040e58b50  Cid 0004.24e8  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa804cfbdb50  Cid 0004.1e7c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa80352a8710  Cid 0004.2d9c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8024c89b50  Cid 0004.1fdc  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8036e74850  Cid 0004.3214  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa80316fdb50  Cid 0004.2028  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8030f53b50  Cid 0004.2e34  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8040c2db50  Cid 0004.32ec  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8033f90670  Cid 0004.1a60  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa802fe4ba80  Cid 0004.1b20  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa80479ea040  Cid 0004.14f0  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa80311d32f0  Cid 0004.1a1c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa803b49fb50  Cid 0004.2d54  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa8030da6640  Cid 0004.182c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT
THREAD fffffa803c989b50  Cid 0004.26ec  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT

Dumping the first thread, but you should examine all of them to check if they have the same behavior.

6: kd> !thread fffffa8024c5b360
THREAD fffffa8024c5b360  Cid 0004.001c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
fffff8800252ddf0  NotificationEvent
IRP List:
fffffa802fd05790: (0006,0430) Flags: 00060043  Mdl: fffffa8055b40bd0
Not impersonating
DeviceMap                 fffff8a0000087d0
Owning Process            fffffa8024c39040       Image:         System
Attached Process          N/A            Image:         N/A
Wait Start TickCount      10307049       Ticks: 933617 (0:04:02:44.518)
Context Switch Count      949185         IdealProcessor: 1
UserTime                  00:00:00.000
KernelTime                00:00:58.983
Win32 Start Address nt!ExpWorkerThread (0xfffff800064ce4a8)
Stack Init fffff8800252edb0 Current fffff8800252d790
Base fffff8800252f000 Limit fffff88002529000 Call 0
Priority 16 BasePriority 13 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`0252d7d0 fffff800`064baf72 : fffff880`0252de01 fffffa80`24c5b360 00000000`00000000 00000000`00000000 : nt!KiSwapContext+0x7a
fffff880`0252d910 fffff800`064cc39f : fffff880`0252de80 fffffa80`53852030 00000000`00000000 fffff880`0252db40 : nt!KiCommitThreadWait+0x1d2
fffff880`0252d9a0 fffff880`0182309f : fffffa80`53a53100 00000000`00000000 fffffa80`53a45700 fffff880`0141c600 : nt!KeWaitForSingleObject+0x19f
fffff880`0252da40 fffff880`0182bd74 : fffff880`0252de80 fffffa80`2fd05790 00000000`00000000 00000000`00000000 : Ntfs!NtfsNonCachedIo+0x23f
fffff880`0252dc10 fffff880`018271b5 : fffff880`0252de80 fffffa80`2fd05790 fffffa80`53a45720 00000000`03e92000 : Ntfs!NtfsNonCachedUsaWrite+0x64
fffff880`0252dca0 fffff880`01827e73 : fffff880`0252de80 fffffa80`2fd05790 fffff880`0252e000 00000000`00001000 : Ntfs!NtfsCommonWrite+0x2ca4
fffff880`0252de50 fffff880`0141dbcf : fffffa80`2fd05b30 fffffa80`2fd05790 fffffa80`54349ad0 00000000`00000000 : Ntfs!NtfsFsdWrite+0x1c3
fffff880`0252e0d0 fffff880`0141c6df : fffffa80`3073e470 fffffa80`55548570 fffffa80`3073e400 fffffa80`2fd05790 : fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x24f
fffff880`0252e160 fffff880`0141dbcf : fffffa80`2fd05b78 fffff880`0141e329 fffff880`0252e200 00000000`000c1004 : fltmgr!FltpDispatch+0xcf
fffff880`0252e1c0 fffff880`0141c6df : fffffa80`53a42380 fffffa80`53a45630 fffffa80`53a42300 fffffa80`2fd05790 : fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x24f
fffff880`0252e250 fffff800`064aa8ff : fffffa80`2fd05790 fffff880`0252e7e0 fffff880`0252e460 fffff880`009f0180 : fltmgr!FltpDispatch+0xcf
fffff880`0252e2b0 fffff800`06508adb : 00000000`03e92000 fffff880`0252e7e0 fffffa80`53a45570 00000000`00000000 : nt!IoSynchronousPageWrite+0x24f
fffff880`0252e330 fffff800`06507168 : fffff8a0`020b2490 fffff8a0`020b2498 fffffa80`53a45570 fffffa80`53a45570 : nt!MiFlushSectionInternal+0xb7b
fffff880`0252e570 fffff800`065066cc : 00000000`00001000 00000000`00001000 00000000`00001000 00000004`f47d2500 : nt!MmFlushSection+0x1f4
fffff880`0252e630 fffff880`0182b70b : fffffa80`53a4df00 00000004`00000000 00000000`00001000 00000000`00000000 : nt!CcFlushCache+0x7bc
fffff880`0252e730 fffff880`018d6dfb : fffff8a0`02090e20 00000004`f47d266e 00000000`00000000 fffff880`00000000 : Ntfs!LfsFlushLfcb+0x647
fffff880`0252e8b0 fffff880`018d8f10 : fffff8a0`020c7760 00000004`f47d266e fffff880`0252ec58 fffff880`0252ec58 : Ntfs!LfsFlushToLsnPriv+0x143
fffff880`0252e940 fffff800`06507a31 : fffff8a0`020c7760 00000004`f47d266e 00000004`f47d266e fffffa80`53a69790 : Ntfs!LfsFlushToLsn+0xa0
fffff880`0252e970 fffff800`065061d2 : fffffa80`53a69790 fffffa80`3241dd80 fffffa80`00000000 fffff880`0252eaa8 : nt!CcAcquireByteRangeForWrite+0x7e7
fffff880`0252ea50 fffff800`06509e46 : fffffa80`53a5d240 00000000`00000001 fffffa80`00000001 fffffa80`00001000 : nt!CcFlushCache+0x2c2
fffff880`0252eb50 fffff800`0650a814 : fffff800`066ce900 fffff880`0252ec58 fffffa80`30d198f0 00000002`c4310000 : nt!CcWriteBehind+0x1c6
fffff880`0252ec00 fffff800`064ce5b9 : fffffa80`24c66530 fffff800`067be300 fffff800`066ce920 00000000`00000000 : nt!CcWorkerThread+0x1c8
fffff880`0252ecb0 fffff800`06763d6a : 00000000`00000000 fffffa80`24c5b360 00000000`00000080 fffffa80`24c39040 : nt!ExpWorkerThread+0x111
fffff880`0252ed40 fffff800`064b6266 : fffff880`021b7180 fffffa80`24c5b360 fffff880`021c21c0 00000000`00000000 : nt!PspSystemThreadStartup+0x5a
fffff880`0252ed80 00000000`00000000 : fffff880`0252f000 fffff880`02529000 fffff880`0252dd30 00000000`00000000 : nt!KiStartSystemThread+0x16

We notice that the threads are doing IO against the file system and there is a IRP ( IO Request Packet ) associate with that. Another piece of information is the tick count which is absurdly high!!!  And if you dump all other threads in the output you see that the WAIT time is very similar. Another tip is to dump the SYSTEM process and look for threads with that high WAIT time associated with a Worker Thread.

Now you can dump the IRP associated with that thread:

6: kd> !irp fffffa802fd05790
Irp is active with 12 stacks 7 is current (= 0xfffffa802fd05a10)
Mdl=fffffa8055b40bd0: No System Buffer: Thread fffffa8024c5b360:  Irp stack trace.
cmd  flg cl Device   File     Completion-Context
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000   

            Args: 00000000 00000000 00000000 00000000
>[  4,34]  1c e0 fffffa80537a1790 00000000 fffff880012831b0-00000000 Success Error Cancel
\Driver\Disk    partmgr!PmReadWriteCompletion
Args: 00001000 00000000 bbd82000 00000000
[  4, 0]  1c e0 fffffa80537a2040 00000000 fffff88001293010-fffffa80537e0e20 Success Error Cancel
\Driver\partmgr    volmgr!VmpReadWriteCompletionRoutine
Args: 613ffaf49d 00000000 bbd82000 00000000
[  4, 0]   c e0 fffffa80537e0cd0 00000000 fffff880011c41b0-fffffa803283a5d0 Success Error Cancel
\Driver\volmgr    volsnap!VspWriteContextCompletionRoutine
Args: 00001000 00000000 613ffaf49b 00000000
[  4, 0]   c e1 fffffa8053852030 00000000 fffff88001823fe4-fffff8800252dde8 Success Error Cancel pending
           \Driver\volsnap    Ntfs!NtfsMasterIrpSyncCompletionRoutine
Args: 00001000 00000000 bbd72000 00000000
[  4, 0]   0 e0 fffffa8053a55030 fffffa8053a45630 fffff88001420e40-fffffa8055548570 Success Error Cancel
\FileSystem\Ntfs    fltmgr!FltpPassThroughCompletion
Args: 00001000 00000000 03e92000 00000000
[  4, 0]   0  1 fffffa8053a42380 fffffa8053a45630 00000000-00000000    pending
\FileSystem\FltMgr
Args: 00001000 00000000 03e92000 00000000

We actually see there is some pending IO !!!

For the next blog we are going to chase down the IO and recognize in which part of the IO stack it is actually pending so we can address it.

Thanks for the company and keep debugging on,

Alessandro

Posted in BSOD, Hang, Performance | Tagged | 1 Comment

Analyzing 0x9E BSOD–Part I

Now that the World Cup is over… By the way Congratulations to all Germans !!

Back to debugging and blogging …

One very popular bugcheck I face is the 0x9E – USER_MODE_HEALTH_MONITOR – that causes some misleading interpretation on the root cause of the crash. Basically is a user mode component that failed to satisfy a health check. It is a common thing on Windows Failover Clustering feature as stated in the following blog – Why is my Failover Clustering node blue screening with a Stop 0x9E ?

But my question is , can you collect information from a kernel memory dump to help you troubleshoot what caused the issue, and the answer is likely YES!

I am going to show some of the kernel dumps with 0x9E I have analyzed lately.

First you need to understand the anatomy of this bugcheck :

6: kd> !analyze -show 0x9E
USER_MODE_HEALTH_MONITOR (9e)
One or more critical user mode components failed to satisfy a health check.
Hardware mechanisms such as watchdog timers can detect that basic kernel
services are not executing. However, resource starvation issues, including
memory leaks, lock contention, and scheduling priority misconfiguration,
may block critical user mode components without blocking DPCs or
draining the nonpaged pool.
Kernel components can extend watchdog timer functionality to user mode
by periodically monitoring critical applications. This bugcheck indicates
that a user mode health check failed in a manner such that graceful
shutdown is unlikely to succeed. It restores critical services by
rebooting and/or allowing application failover to other servers.
Arguments:
Arg1: 0000000000000000, Process that failed to satisfy a health check within the
configured timeout
Arg2: 0000000000000000, Health monitoring timeout (seconds)
Arg3: 0000000000000000
Arg4: 0000000000000000

Basically something is not right and we need to recover ASAP so we are going to crash the box, lately we are seeing a lot of this workaround from Exchange ( 0xEF and 0xF4 ) and cluster service with the assumption we have several other nodes to pick up the load and we don’t want to hang until it goes away or the admin restarts the box.

So we are dealing with a kernel hang dump and just treat it as such with an added bonus that it is telling the process that failed to satisfy the timeout and threshold. Definitely it would be awesome to have user mode memory address from the process, but setting a full memory dump can consume a lot of disk space and definitely it is not going to be the first type of dump you are going to receive.

As far as hang dumps are concerned, I always use these extensions:

!vm , !ready, !locks, !running –it , !dpcs , !exqueue /f

By using all of those extensions you collect basic information about the state of the server at the moment of the crash.

In some dumps we see right away that there is something blocking.

In one of my analysis I had !exqueue pointing to several PENDING IO due to System Working Threads Exhaustion :

 

6: kd> !exqueue /f
**** Critical WorkQueue ( Threads: 26/512, Concurrency: 0/12 )
THREAD fffffa8024c5b360  Cid 0004.001c  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT

*

*

*

PENDING: IoWorkItem (fffffa80538a3730) Routine usbccgp!UsbcWorkerFunction (fffff8800512f288) IoObject (fffffa803041c060) Context (fffffa80309c24f0)
PENDING: ExWorkItem (fffff80006678200) Routine nt!ObpProcessRemoveObjectQueue (fffff80006779398) Parameter (0000000000000000)
PENDING: ExWorkItem (fffffa80553ebdc0) Routine NDIS!ndisWorkItemHandler (fffff88001155010) Parameter (fffffa80553ebdb0)
PENDING: IoWorkItem (fffffa8054450950) Routine CLFS!CClfsContainer::WorkRoutine (fffff88000d3adc0) IoObject (fffffa8052a09780) Context (fffffa8025202e80)
PENDING: ExWorkItem (fffff8800186d050) Routine Ntfs!NtfsFspClose (fffff8800189c1cc) Parameter (0000000000000000)
PENDING: ExWorkItem (fffffa802fe4e380) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa802fe4e370)
PENDING: IoWorkItem (fffffa8055a62e60) Routine msiscsi!iSpPrematureSessionTerminationWorker (fffff88004f0125c) IoObject (fffffa80252df3b0) Context (fffffa805574c390)
PENDING: ExWorkItem (fffffa8054400a80) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa8054400a70)
PENDING: ExWorkItem (fffffa8055d290b0) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa8055d290a0)
PENDING: ExWorkItem (fffffa805467e690) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa805467e680)
PENDING: IoWorkItem (fffffa805377f1f0) Routine partmgr!PmNotificationWorkItem (fffff8800127f830) IoObject (fffffa80537acb90) Context (0000000000000000)
PENDING: ExWorkItem (fffffa8033c0e170) Routine NDIS!ndisWorkItemHandler (fffff88001155010) Parameter (fffffa8033c0e160)
PENDING: ExWorkItem (fffffa8032534ef0) Routine NDIS!ndisWorkItemHandler (fffff88001155010) Parameter (fffffa8032534ee0)
PENDING: ExWorkItem (fffffa8042bdf240) Routine NDIS!ndisWorkItemHandler (fffff88001155010) Parameter (fffffa8042bdf230)
PENDING: ExWorkItem (fffffa80378479c0) Routine NDIS!ndisWorkItemHandler (fffff88001155010) Parameter (fffffa80378479b0)
PENDING: ExWorkItem (fffff8800186d1d8) Routine Ntfs!NtfsCheckUsnTimeOut (fffff8800188b02c) Parameter (0000000000000000)
PENDING: IoWorkItem (fffffa8053642d00) Routine partmgr!PmNotificationWorkItem (fffff8800127f830) IoObject (fffffa80537a4040) Context (0000000000000000)
PENDING: IoWorkItem (fffffa805367f4b0) Routine partmgr!PmNotificationWorkItem (fffff8800127f830) IoObject (fffffa80537a79e0) Context (0000000000000000)
PENDING: ExWorkItem (fffffa8053f866d0) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa8053f866c0)
PENDING: ExWorkItem (fffffa8033fef370) Routine NDIS!ndisDoOidRequests (fffff880010bf430) Parameter (fffffa8033fef360)

At this point we know that the server is in hang state and if you follow this clue and resolve why we have some many pending I/O’s you will also get to the bottom of the 0x9E bugcheck.

Since this blog was getting too long, I will follow up later with this same dump and point where we had the issue.

Thanks and keep on debugging,

Alessandro

Posted in BSOD, Hang | Tagged | 1 Comment

Exchange Server 2013–0xEF BSOD

Lately I have seen a couple of incidents where Exchange Server 2013 machines would blue screen periodically. All BSOD’s had the same bugcheck code 0xEF.

From WinDBG help:

0: kd> !analyze -show 0xEF
CRITICAL_PROCESS_DIED (ef)
        A critical system process died
Arguments:
Arg1: 0000000000000000, Process object or thread object
Arg2: 0000000000000000, If this is 0, a process died. If this is 1, a thread died.
Arg3: 0000000000000000
Arg4: 0000000000000000

Well in this case we know that something killed a critical part of the operating system. From one of the bugcheck code:

0: kd> .bugcheck
Bugcheck code 000000EF
Arguments fffffa80`30e39200 00000000`00000000 00000000`00000000 00000000`00000000

0: kd> !process fffffa8030e39200 0
GetPointerFromAddress: unable to read from fffff802c9775000
PROCESS fffffa8030e39200
    SessionId: none  Cid: 01e0    Peb: 7f74e02f000  ParentCid: 018c
    DirBase: 108e0c000  ObjectTable: fffff8a0019a9600  HandleCount: <Data Not Accessible>
    Image: wininit.exe

We know that something terminated WININIT, from Windbg since we know that once we open it the current thread is displayed by !thread

0: kd> !thread
GetPointerFromAddress: unable to read from fffff802c9775000
THREAD fffffa803a4146c0  Cid 05c8.a370  Teb: 000007f5ff496000 Win32Thread: fffff90100751010 RUNNING on processor 0
Not impersonating
GetUlongFromAddress: unable to read from fffff802c9694cf0
Owning Process            fffffa8031941080       Image:        msexchangerepl
Attached Process          N/A            Image:         N/A
fffff78000000000: Unable to get shared data
Wait Start TickCount      79453518    
Context Switch Count      295            IdealProcessor: 0            
ReadMemory error: Cannot get nt!KeMaximumIncrement value.

It points to a thread from MSExchangeRepl service, by dumping the stack you see the Terminate Process

0: kd> k
Child-SP          RetAddr           Call Site
fffff880`0b2f49a8 fffff802`c99a165d nt!KeBugCheckEx
fffff880`0b2f49b0 fffff802`c9938e7e nt!PspCatchCriticalBreak+0xad
fffff880`0b2f49f0 fffff802`c98ae7a1 nt! ?? ::NNGAKEGL::`string’+0x4aba4
fffff880`0b2f4a50 fffff802`c98b45d6 nt!PspTerminateProcess+0x6d
fffff880`0b2f4a90 fffff802`c9476453 nt!NtTerminateProcess+0x9e
fffff880`0b2f4b00 000007fa`b5ee2eaa nt!KiSystemServiceCopyEnd+0x13
00000000`2442e5d8 00000000`00000000 0x000007fa`b5ee2eaa

 

So now we know that MSExchangeRepl called nt!NtTerminateProcess. Basically your time with WinDBG is done, the good thing is that all information can be obtained from a Mini dump and now we have to go and check Exchange Server for what could be possibly causing this BSOD, and you guessed if you say Managed Availability Smile

Run the following powershell command from kb 2883203

(Get-WinEvent -LogName Microsoft-Exchange-ManagedAvailability/* | % {[XML]$_.toXml()}).event.userData.eventXml| ?{$_.ActionID -like “*ForceReboot*”} | ft RequesterName

It will show you the requester which issued the Reboot and then you start your Exchange Troubleshooting from there.

Good hunt,

Alessandro

Posted in 0xEF, BSOD | Tagged | Leave a comment