Errors, even little ones, can be performance killers. Collect every meter you can that tracks errors and other unfortunate events. Over time investigate what they are telling you.
- What is the nature of this problem
- What causes this problem?
- Why is this problem happening?
- How does the system work around (or suffer because of) this problem?
The errors that most affect response time and throughput tend to be “timeout” errors, where something waited and waited and finally gave up. Big problems with timeout errors tend to show up as suspiciously low utilization. There is work waiting, but key resources are less busy than normal.
Some errors are unavoidable. You will always see a few of them in the data. The key is to know what’s normal. When monitoring errors, notice when there are a lot more errors than usual for a given transaction rate. Investigate that.
In the above graph the transaction rate is fairly steady, but just after 11:15 the error rate takes off. Don’t panic. You need to keep a sense of scale here. At 12:00 there are about 3500 transactions per minute and just a little less than 80 errors per minute. So we are seeing approximately one error for every 43 transactions. You should still investigate this, especially if the response time increased at the same time. Given the low number of errors per transaction, this error is unlikely to be the cause of an overall response time problem, but it might be an interesting clue as to what’s going on.