Performance monitoring is about understanding what’s happening right now. It usually includes dealing with immediate performance problems or collecting data that will be used by the other performance tools (such as capacity planning) to plan for future peak loads.
In performance monitoring you need to know three things:
- The incoming workload
- The resulting resource consumption
- What is normal under this load
Without these three things you can only solve the most obvious performance problems and have to rely on tools outside the scientific realm (such as a Ouija Board, or a Magic 8 Ball) to predict the future.
You need to know the incoming workload (what the users are asking your system to do) because all computers run just fine under no load. Performance problems crop up as the load goes up. These performance problems come in two basic flavors: Expected and Unexpected.
Expected problems are when the users are simply asking the application for more things per second than it can do. You see this during an expected peak in demand like the biggest shopping day of the year. Expected problems are no fun, but they can be foreseen and, depending on the situation, your response might be to endure them, because money is tight or because the fix might introduce too much risk.
Unexpected problems are when the incoming workload should be well within the capabilities of the application, but something is wrong and either the end-user performance is bad or some performance meter makes no sense. Unexpected problems cause much unpleasantness and demand rapid diagnosis and repair.
The key to all performance work is to know what is normal. Let me illustrate that with a trip to the grocery store.
One day I was buying three potatoes and an onion for a soup I was making. The new kid behind the cash register looked at me and said: “That will be $22.50.” What surprised me was the total lack of internal error checking at this outrageous price (in 2012) for three potatoes and an onion. This could be a simple case of them not caring about doing a good job, but my more charitable assessment is that he had no idea what “normal” was, so everything the register told him had to be taken at face value. Don’t be like that kid.
On any given day you, as the performance person, should be able to have a fairly good idea of how much work the users are asking the system to do and what the major performance meters are showing. If you have a good sense of what is normal for your situation, then any abnormality will jump right out at you in the same way you notice subtle changes in a loved one that a stranger would miss. This can save your bacon because if you spot the unexpected utilization before the peak occurs, then you have time to find and fix the problem before the system comes under a peak load.
There are some challenges in getting this data. For example:
- There is no workload data.
- The only workload data available (ex: per day transaction volume) is at too low a resolution to be any good for rapid performance changes.
- The workload is made of many different transaction types (buy, sell, etc.) It’s not clear what to meter.
With rare exception I’ve found the lack of easily available workload information to be the single best predictor of how bad the overall situation is performance wise. Over the years as I visited company after company this led me to develop Bob’s First Rule of Performance Work: “The less a company knows about the work their system did in the last five minutes, the more deeply screwed up they are.”
What meters should you collect? Meters fall into big categories. There are utilization meters that tell you how busy a resource is, there are count meters that count interesting events (some good, some bad), and there are duration meters that tell you how long something took. As the commemorative plate infomercial says: “Collect them all!” Please don’t wait for perfection. Start somewhere, collect something and, as you explore and discover, add newly discovered meters to your collection.
When should you run the meters? Your meters should be running all the time (like bank security cameras) so that when weird things happen you have a multitude of clues to look at. You will want to search this data by time (What happened at 10:30?), so be sure to include timestamps.
The data you collect can also be used to predict the future with tools like: Capacity Planning, Load Testing, and Modeling. But, that’s another post for another day.