DAG Node Join Fail–0x5b4

Just a quick blog and reminder for something I worked a couple of months ago, or even longer and it took a considerable amount of time to figure out. Then last week a colleague just ping me with a question about a case they have been dealing for quite sometime.

Upon trying to join a second DAG member, in a virtualized Hyper-V environment, the following error message is presented:

“WriteError! Exception = Microsoft.Exchange.Cluster.Replay.DagTaskOperationFailedException: A server-side database availability group administrative operation failed. Error The operation failed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: Cluster API ‘”AddClusterNode() (MaxPercentage=100) failed with 0x5b4. Error: This operation returned because the timeout period expired”‘ failed.. —> Microsoft.Exchange.Cluster.Replay.AmClusterApiException: An Active Manager operation failed. Error: An error occurred while attempting a cluster operation. Error: Cluster API ‘”AddClusterNode() (MaxPercentage=100) failed with 0x5b4. Error: This operation returned because the timeout period expired”‘ failed. —> System.ComponentModel.Win32Exception: This operation returned because the timeout period expired”

If you look at the cluster logs you would notice the heartbeat failing on port UDP 3343.

Obviously you need to go by the normal troubleshooting steps of network connectivity and firewall rules. If everything seems right, and as my friend pointed out all knobs were turned and he was ready to give up. That’s when I asked…

Have you disabled TCP/UDP Checksum offload on both host and guest network cards ?

After disabling these options on both guest and parent partition cluster could be formed without any issues.

So in this case, after you make sure the obvious have been checked out and you still cannot join your cluster nodes in a virtual environment, take a look at the NIC properties for TCP/UDP Checksum Offload for IPv4 and IPv6.

Thanks,

Alessandro

Posted in ExchangeDAG | Tagged , | 3 Comments

File System Cache–Dirty Pages Threshold

There are weeks where the same type of issues are reoccurring and makes you study and understand a little bit more about Windows Internals.

These past two weeks I had a couple of cases involving hangs that were caused by the File System Cache, and they will be the topic of this two-part blog.

With every support incident it is very important to understand the system and make sure you collect as much information as possible to help with the issue. In this case, a server had a brief hang, meaning that it was unresponsive for couple of minutes but it would recover afterwards. The biggest issue is that the backup operation never completed successfully since the application would time out before the server recovered.

During the hang period a kernel dump was collected to try to discover the reason the server was unresponsive, and luckily the dump was collected at the right time.

The investigation pointed to the a large number of outstanding pages within the File System Cache, or dirty pages , that were larger than the threshold set by the operating system. The easiest way to check these values is by using the WinDBG extension !defwrites

*** Cache Write Throttle Analysis ***

CcTotalDirtyPages:               9242930 (36971720 Kb)
CcDirtyPageThreshold:            9242246 (36968984 Kb)
    MmAvailablePages:               22047648 (88190592 Kb)
MmThrottleTop:                       450 (    1800 Kb)
MmThrottleBottom:                     80 (     320 Kb)
MmModifiedPageListHead.Total:    9283506 (37134024 Kb)

CcTotalDirtyPages >= CcDirtyPageThreshold, writes throttled

This indicated that this server was throttling writes to File System until it could keep up and flush those pages. Another detail is that the server had plenty of memory available

*** Virtual Memory Usage ***
    Physical Memory:    33538195 ( 134152780 Kb)
Page File: \??\C:\pagefile.sys
Current: 134152780 Kb  Free Space: 134140828 Kb
Minimum: 134152780 Kb  Maximum:    402458340 Kb
Available Pages:    22047648 (  88190592 Kb)
ResAvail Pages:     32989232 ( 131956928 Kb)
Locked IO Pages:           0 (         0 Kb)
Free System PTEs:   33506900 ( 134027600 Kb)
Modified Pages:      9283506 (  37134024 Kb)

However the CcDirtyPageThreshold is set at 50% of physical memory and if you look at the current values at the moment of the dump we are below that threshold of 50% of physical memory , which means the operating system started to lower the threshold in an attempt to clear up the dirty pages faster, check Kb920739

********************

The System Internals Cache Manager uses a variable that is named CcDirtyPageThreshold. By default, the value ofCcDirtyPageThreshold may be set too high for scenarios where there are many lazy writes. By default, theCcDirtyPageThreshold global kernel variable is set to a value that is half of the physical memory. This variable triggers the cache manager’s write throttles.

*********************

If you want to check all the variables that contain Dirty use x nt!*Dirty*.

Once you find what you want , for instance fffff800`06020ae0 nt!CcTotalDirtyPages , just dump it :

7: kd> dc fffff800`06020ae0
fffff800`06020ae0  008d0932 00000000 00000001 00000000  2……………
fffff800`06020af0  0000001d 00000006 00000000 00000000  …………….

Now use .formats to transform to decimal and you check that this is the same value in the !defwrites extension output.

7: kd> .formats 008d0932
Evaluate expression:
Hex:     00000000`008d0932
Decimal: 9242930

Now that we have an idea about the cause of the hang and that it is related to a backup operation, we turn to reasons a backup operations would use File System Cache !?

By checking with the vendor settings we verified that Buffered I/O was turned on… Once we turn it OFF backups would complete successfully and more importantly the server did not hang.

Hope you appreciate it and hang on for the upcoming part 2 with Too Much Read Cache.

Here it is some good reads:

http://msdn.microsoft.com/en-us/library/windows/desktop/aa364218(v=vs.85).aspx

http://blogs.msdn.com/b/ntdebugging/archive/2007/11/27/too-much-cache.aspx

http://blogs.msdn.com/b/ntdebugging/archive/2007/10/10/the-memory-shell-game.aspx

Good Hunting,

Alessandro

Posted in Performance, WinDBG Trick | Tagged , , | Leave a comment

PCI Parity Error

I was asked for a quick way to identify which hardware device is logging the following entry in the hardware log:

image

A PCI parity error was detected on a component at bus 64 device 5 function 2.

A dump was available and thinking about a little we can do the following with the kernel memory dump.

First we need to know the bus ID in hexadecimal format, within WinDBG use .formats command.

.formats 0n64 (to entry 64 as decimal)

64DEC=40HEX

List the PCI tree with !pcitree extension and find bus 0x40

Bus 0x40 (FDO Ext fffffa809203ace0)

(d=5, f=0) 80863c28 devext 0xfffffa8146a171b0 devstack 0xfffffa8146a17060 0880 Base System Device/’Other’ base system device

(d=5, f=2) 80863c2a devext 0xfffffa8146a161b0 devstack 0xfffffa8146a16060 0880 Base System Device/’Other’ base system device —> Problematic device

From here we already have the command to tell us as much as possible from the device:

0: kd> !devstack 0xfffffa8146a16060

!DevObj !DrvObj !DevExt ObjectName

> fffffa8146a16060 \Driver\pci fffffa8146a161b0 NTPNP_PCI0019

!DevNode fffffa8092034d30 :

DeviceInst is “PCI\VEN_8086&DEV_3C2A&SUBSYS_04DB1028&REV_07\3&1e630cbb&0&2A”

Just a quick search for PCI\VEN_8086DEV_3C2A:

http://wikidrivers.com/wiki/Intel_Chipset_Device

Intel Xeon Processor E5 Product Family/Core i7 Control Status and Global Errors – 3C2A – PCI\VEN_8086&DEV_3C2A

We know it is one of the CPU’s generating the PCI Parity event, but we cannot tell emphatically which one since the system has two physical CPU’s, as you can check with !sysinfo cpuinfo extension or checking the Windows Object Manager extension (!object)

Just a snipped of !object \Global??\ shows two Intel Xeon

fffff8a0006fd850 SymbolicLink ACPI#GenuineIntel_-_Intel64_Family_6_Model_45_-________Intel(R)_Xeon(R)_CPU_E5-2650_0_@_2.00GHz#10#{97fadb10-4e33-40ae-359c-8bef029dbdd0}

fffff8a0006f87c0 SymbolicLink ACPI#GenuineIntel_-_Intel64_Family_6_Model_45_-________Intel(R)_Xeon(R)_CPU_E5-2650_0_@_2.00GHz#_6#{97fadb10-4e33-40ae-359c-8bef029dbdd0}.

In this case the action plan would be to check if there are more entries at WHEA Windows logs, make sure BIOS is updated as well as firmware for the box.

I hope it helps you in your debugging.

Thanks,

Alessandro

Posted in WinDBG Trick | Tagged , | 1 Comment