For short duration performance problems where you know when they will start (market open, 10pm backup, etc.) here are some tips for setting up special metering to catch them:
- Start your meters well before the problem happens. Have them run a few times to be sure they are working as expected and have them just sleep until 15 minutes before the problem starts.
- Meter at a frequency that is at least ¼ of the expected duration of the event – this gives you multiple samples during the event.
- Let the meters run for 15 minutes after the problem is usually over.
- Now you have meters collected before, during, and after the event. Compare and contrast them looking for what changed and what stayed the same during the event.
It is quite common for people to be suspicious that the new metering you are running is making the problem much worse. That’s why it is a very good idea to have it running well before the anticipated start of the problem. There is some cause for this suspicion as a small typo can turn a “once per minute” metering macro into a “fast as you can” metering macro that burns CPU, fills disks, and locks data structures at a highly disruptive rate. Like a physician, your primary goal should be: First, do no harm. It is always a good idea to test your meters at a non-critical time and (if possible) to meter the meters so you can show the resource usage of your meters.
If the problem happens without warning, then, if possible, identify something or some event that usually precedes the problem that you can “trigger on” to start the intensive metering. A trigger might be when the queue has over X things in it, or when something fails, or when something restarts, etc. Finding the trigger can sometime be frustrating, as correlation does not always mean causality. Keep searching the logs and any other meters you have.
Sometimes, all you have to go on is that it “happens in the morning” or “mostly on Mondays.” Work with what you’ve got and meter during those times.
If the problem has no known trigger and seems to happen randomly, you’ll have to intensively meter for it until it happens again. This will burn some system resources and give you a mountain of data to wade through. If this is a serious problem, then buckle up and do the work.