The day before yesterday I saw our main live server’s CPU go up to constantly 100% with just the occasional short drop to a lower level. The exact opposite of what you’d want to see.
We’re log shipping every 15 minutes and part of that involves calling WinRAR to compress the log backups before copying them over. (We’re on SQL2005 so there’s no native compression and we have bandwidth issues with the connection to our remote site.) I realised the log shipping jobs were taking about 10 minutes and that most of that was spent shipping a ‘live’ reporting database that is completely rebuilt every 20 minutes. (I’m just trying to keep this stuff alive until I can improve it.) We can rebuild this database in minutes if we have to fail over so I disabled log shipping of that database. The log shipping went down to less than 2 minutes and I went off to the SQL Social evening in London feeling quite pleased with myself. It was a great evening – fun, educational and thought-provoking. Thanks to Simon Sabin & co for laying that on, and thanks too to the guests for making the effort when they must have been pretty worn out after doing DevWeek all day first.
The next morning I came down to earth with a bump: CPU still at 100%. WTF? I looked in the activity monitor but it was confusing because some sessions have been running for a long time so it’s not a good guide what’s using the CPU now. I tried the standard reports showing queries by CPU (average and total) but they only show the top 10 so they just show my big overnight archiving and data cleaning stuff. But the Profiler showed it was four queries used by our new website usage tracking system. Four simple indexes later the CPU was back where it should be: about 20% with occasional short spikes.
So the moral is: even when you’re convinced you’ve found the cause and fixed the problem, you HAVE to go back and confirm that the problem has gone.
And, yes, I have checked the CPU again today and it’s still looking sweet.