There are weeks where the same type of issues are reoccurring and makes you study and understand a little bit more about Windows Internals.
These past two weeks I had a couple of cases involving hangs that were caused by the File System Cache, and they will be the topic of this two-part blog.
With every support incident it is very important to understand the system and make sure you collect as much information as possible to help with the issue. In this case, a server had a brief hang, meaning that it was unresponsive for couple of minutes but it would recover afterwards. The biggest issue is that the backup operation never completed successfully since the application would time out before the server recovered.
During the hang period a kernel dump was collected to try to discover the reason the server was unresponsive, and luckily the dump was collected at the right time.
The investigation pointed to the a large number of outstanding pages within the File System Cache, or dirty pages , that were larger than the threshold set by the operating system. The easiest way to check these values is by using the WinDBG extension !defwrites
*** Cache Write Throttle Analysis ***
CcTotalDirtyPages: 9242930 (36971720 Kb)
CcDirtyPageThreshold: 9242246 (36968984 Kb)
MmAvailablePages: 22047648 (88190592 Kb)
MmThrottleTop: 450 ( 1800 Kb)
MmThrottleBottom: 80 ( 320 Kb)
MmModifiedPageListHead.Total: 9283506 (37134024 Kb)
CcTotalDirtyPages >= CcDirtyPageThreshold, writes throttled
This indicated that this server was throttling writes to File System until it could keep up and flush those pages. Another detail is that the server had plenty of memory available
*** Virtual Memory Usage ***
Physical Memory: 33538195 ( 134152780 Kb)
Page File: \??\C:\pagefile.sys
Current: 134152780 Kb Free Space: 134140828 Kb
Minimum: 134152780 Kb Maximum: 402458340 Kb
Available Pages: 22047648 ( 88190592 Kb)
ResAvail Pages: 32989232 ( 131956928 Kb)
Locked IO Pages: 0 ( 0 Kb)
Free System PTEs: 33506900 ( 134027600 Kb)
Modified Pages: 9283506 ( 37134024 Kb)
However the CcDirtyPageThreshold is set at 50% of physical memory and if you look at the current values at the moment of the dump we are below that threshold of 50% of physical memory , which means the operating system started to lower the threshold in an attempt to clear up the dirty pages faster, check Kb920739
The System Internals Cache Manager uses a variable that is named CcDirtyPageThreshold. By default, the value ofCcDirtyPageThreshold may be set too high for scenarios where there are many lazy writes. By default, theCcDirtyPageThreshold global kernel variable is set to a value that is half of the physical memory. This variable triggers the cache manager’s write throttles.
If you want to check all the variables that contain Dirty use x nt!*Dirty*.
Once you find what you want , for instance fffff800`06020ae0 nt!CcTotalDirtyPages , just dump it :
7: kd> dc fffff800`06020ae0
fffff800`06020ae0 008d0932 00000000 00000001 00000000 2……………
fffff800`06020af0 0000001d 00000006 00000000 00000000 …………….
Now use .formats to transform to decimal and you check that this is the same value in the !defwrites extension output.
7: kd> .formats 008d0932
Now that we have an idea about the cause of the hang and that it is related to a backup operation, we turn to reasons a backup operations would use File System Cache !?
By checking with the vendor settings we verified that Buffered I/O was turned on… Once we turn it OFF backups would complete successfully and more importantly the server did not hang.
Hope you appreciate it and hang on for the upcoming part 2 with Too Much Read Cache.
Here it is some good reads: