For any performance meter you need to know four key things about it or it tells you almost nothing. Any assumptions you make can easily lead you astray. If you know these four things, the meter becomes a lot more useful.
Let’s look at a simple little meter and explore our ignorance about it to figure out what we need to know. Imagine you learn this: The application did 3000 writes. Here are the four key things you don’t know.
1. The Time The Meter Was Taken
First and foremost, you need to know when the data was collected because no meter is an island. It is almost never the case that all possible metering data comes from one source. You’ll have to dig into various sources and coordinate with other people. The way you link them together is time.
If your meters are collected in different time zones, be sure to note that in the data. When discussing performance issues with people across time zones the most common mistake is to assume that we are all talking about the same point in time when we say: “at 3:05…”
Since time is usually easy to collect and takes up very little space, I always try to record the time starting with the year and going down to the second. 2012-10-23 12:34:23. You may not need that precision for the current question, but someday you may need it to answer a different question.
Adding the time can tell you: The application reported 3000 writes at 3:05:10PM EST on June 19, 2012. You can now compare and contrast this data with all other data sources.
2. The Sample Length of The Meter
Any meter that gives you an averaged value has to average the results over a period of time. The most common averaged value is a utilization number.
The two graphs below show exactly the same data with the only difference being the sample length of the meter. In the chart below the data was averaged every minute. Notice the very impressive spike in utilization in the middle of the graph. During this spike this resource had little left to give.
In the chart below the same data was averaged every 10-minutes. Notice that the spike almost disappears as the samples were taken at such times that part of the spike was averaged into different samples. Adjusting the sample length can dramatically change the story.
Some meters just report a count, and you’ve got to know when that count gets reset to zero or rolls over because the value is too big for the variable to hold. Some values start incrementing at system boot, some at process birth.
Some meters calculate the average periodically on their own schedule, and you just sample the current results when you ask for the data. For example, a key utilization meter is calculated once every 60 seconds and, no matter what is going on, the system reports exactly the same utilization figure for the entire 60 seconds. This may sound like a picky detail to you now, but when you need to understand what’s happening in the first 30 seconds of market open, these little details matter.
Adding the sample length, we now know: The application reported 3000 writes between 3:00-3:05:10PM EST on June 19, 2012.
3. What Exactly Is Being Metered
As the old saying goes: “When you assume, you make an ass out of u and me.” Here we have two undefined terms: application and writes.
An “application” is usually many processes that can be spread over many computers. So we need a little more precision here. Where did those 3000 writes come from? Just one process? All processes on a given system? All processes on all systems?
“Writes” can be measured in bytes, file records, database updates, disk blocks, etc. Some of these have much bigger performance impacts than others.
Even within a given metering tool, it is common to see the same word mean several different things in different places. Consistency is not a strong point in humans. So don’t assume. Ask, investigate, test and double check until you know what these labels mean. The more precisely you understand what the meter measures, the more cool things you can do with it.
Adding the specifics about what is being metered, we now know: All application processes on computer X reported 3000 blocks written to disk Y between 3:00-3:05:10PM EST on June 19, 2012.
4. The Units Used in The Meter
Lastly, pay attention to units. When working with data from multiple sources (and talking to multiple people) it is really easy to confuse the units of speed (milliseconds, microseconds), size (bits vs. bytes), and throughput (things per second or minute) and end up with garbage. It much harder to communicate if I’m talking about “meters per second” and you are talking about “furlongs per fortnight” . It is best to try to standardize your units and use the same ones in all calculations and conversations.
Since 5 minutes is 300 seconds, we can calculate the application was doing an average of 10 writes/second = 3000 writes / 300 seconds.
So finally we can tell you that: All application processes on computer X wrote an average of 10 blocks per second to disk Y between 3:00-3:05:10PM EST on June 19, 2012.
Now we really know something about what’s going on and have a value specified in a common unit we can compare and contrast with other data.
Lastly, here is a brief example illustrating how important correct use of units can be. I know of one company that had to give away about five million dollars worth of hardware on a fixed price bid due to a metering mistake by the technical sales team where kilobytes were confused with megabytes. Ouch.