Recently had a MSSQL 2005 server stop and required a windows server reboot
The follow is the technical detail and explanation/resources I found/followed – hope it is of use to somebody!
Current & relevant configuration:
- Standard Edition (64-bit) on Windows NT 5.2 (Build 3790: Service Pack 2)
- Microsoft SQL Server 2005 – 9.00.4207.00 (X64) = SP3 CU1; latest MS build is 9.00.4305 SP3 CU10
- 31.9GB physical RAM, 4092MB page file on c:; windows recommends 49,143MB; my normal recommendation is 2.5x physical = 81,920MB
- SQL Server configured for max memory 27,000MB ~= 26.36GB; not to use AWE; “Lock Pages in Memory” is disabled as x64/default
Train of events:
- last logged normal operation which was a transaction log backup as per schedule
- Log message “A significant part of sql server process memory has been paged out. This may result in a performance degradation. Duration: 0 seconds. Working set (KB): <size>, committed (KB): <size>, memory utilization: 0%.”
- server rebooted
- server starting services
- SQL recovery complete – server ready to accept and process connections; service resumed
- SQL reports database mirroring is disconnected
- Error: 9642, Severity: 16, State: 3.
- An error occurred in a Service Broker/Database Mirroring transport connection endpoint, Error: 8474, State: 11. (Near endpoint role: Target, far endpoint address: ”)
- MSSQL restart resolves mirroring disconnect – details attached
- Full DR resumed
Investigations into root cause of stop:
- Given the current configuration of page file it is understandable the server ran out of memory.
- Research (ref1, ref2 & ref3) into the memory page out error suggests the server basically ran out of memory
- ref1 correlates that other people have had stops due to the same cause; refers ref2
- ref2 gives background regarding memory working sets for SQL and concurs that a SQL server is simply not just the memory for sqlservr.exe
- ref3 (KB918483) explains how to prevent the OS operations that lead to the stop
- This article then refers a blog post (ref4) form the CSS engineers who have given background on memory trim’s; there are three (3) types of trim:
- Signalled trim – does not generate our error message
- Self trim – the OS requires additional RAM for pages not in memory, the “amount to trim can be ~1/4 the current working set size” – therefore we don’t have a page file large enough for this trim
- Hard trim – system memory is very low and the “trim can be the entire working set” – again, we do not have a large enough page file
- Therefore it is essential we increase the page file size.
- ref4 gives several signals as to the cause of the memory trim. Those of immediate interest are as follows:
- system file cache size (ref5)
- “TCP Chimney” with Broadcom network adapters
- ref5 discusses the performance counters that indicate whether there is an underlying OS issue with system file cache size
- This article then refers a blog post (ref4) form the CSS engineers who have given background on memory trim’s; there are three (3) types of trim:
Next Steps:
- Increase page file size; 2.5x is my own blanket rule of thumb, but given the drive capacities we may had to increase to a lower value initially and monitor:
- Check Broadcom Network Adapter drivers are latest
- PerfMon and monitor the system file cache size issue