How To Collect Workload Data With Performance Meters

Many performance meters in your computing world will tell you how busy things are. That’s nice, but to make sense of that data, you also need to know how much work the system is being asked to handle.

With workload data you see the performance meters with fresh eyes, as now you can evaluate how the system is responding to that workload and, with proper capacity planning or modeling, how the system will respond to a future peak workload.oranges

To find or create a workload meter you and your coworkers have to agree on what the workload is and how to measure it. This will take some time as you’ve got to choose the transactions that will represent your workload from the many unique transactions users send in. Every company will settle on a different scheme and there is no perfect solution. Here are a few common ones I’ve encountered:

  • Treat all incoming transactions the same. Simply count them and you have your workload number.
  • Notice the vast majority of your incoming transactions do a similar thing, so count them as your workload number and ignore the others.
  • Only count the transaction that was at the center of your last performance catastrophe as your workload. This may be an unwise choice as always swinging to hit the previous pitch you missed will not improve your batting average.
  • Use the amount of money flowing into the company as the transaction meter. At $10K/min the CPU is 35% busy.

Whatever you decide to do will work fine as long as it passes the following simple test: Changes in workload should show proportional changes in the meters of key resources. What you are looking for is data like you see below.

workloadx

Clearly as the measured workload increases the utilization of Resource X follows along with it. It is just fine that the lines don’t perfectly overlap. They never will. It is the overall shape that is important. You are looking for these values to move in synchrony – workload changes cause a proportional change in utilization.  Now, let’s look at Resource Y below.

workloady

Resource Y is not experiencing any changes in utilization as the workload changes. This resource is not part of the transaction path for this workload. If it is supposed to be, perhaps you accidentally metered the wrong resource, or the right resource on the wrong system. I’ve made both of those mistakes.

Now, let’s look at Resource Z below.

workloadz

The utilization of Resource Z mostly tracks the workload meter (they rise and fall together) except for about an hour starting at 19:49 and ending at 20:39. Here you need to use some common sense, as either:

  • The utilization spike could have been caused by something not related to the normal workload like a backup, or a software upgrade, or just the side effect of hitting some bug. In that case you can ignore the spike as you evaluate this workload meter.
  • The utilization spike showed a dramatic increase in the workload, but your proposed workload meter did not see it. If your proposed workload meter missed a dramatic and sustained increase like we see here, then you need to search for a better workload meter.

Once you’ve decided what will serve as the workload meter, how do you get the data you need? It would be lovely if the application gave you an easy-to-access meter for that, but that rarely happens. Usually, you have to look in odd places. If XYZ transactions are going to be your workload indicator, then you need to find some part of the XYZ transaction path where you can find something to meter that uniquely serves that transaction. Here are some of the things I’ve done in the past to ferret out this key information:

  • For every XYZ transaction, process Q does two reads to a given file. Take the number of reads in the last interval and divide them by two to get the transaction rate.
  • For every 500 XYZ transactions, process Q burns one second of CPU. Take the number of CPU seconds consumed during the interval and multiply it by 500 to get the transaction rate.
  • For every XYZ transaction file Z grows by 1200 bytes. Take the change in the file size during the interval and divide that by 1200 to get the transaction rate.
  • For every XYZ transaction two packets are sent. Divide the packet count by two to get the transaction rate.

This list goes on, but the basic trick remains the same. First find some meter that closely follows the type of transaction you want to use as a workload meter. Then figure out how to adjust it mathematically so you get a transaction count.

That mathematical adjustment usually requires you to get multiple days worth of data and then, using whatever data you can get on your transaction, come up with the appropriate adjustment.

Reasonable people can argue that it is impossible to summarize a complex workload into one number. That may be true, but you can still do wildly useful things if you find a workload meter that approximately tracks the utilization of key components nicely. Every evening on the news they quote a major stock index (like the DOW, the FTSE, or the Nikkei) and we find that a useful gauge of the overall economy. When selecting the workload, don’t go for perfect, go for close enough.

There is more to learn about performance work in this blog and in my book
The Every Computer Performance Book, which you can find on Amazon and iTunes.

Synchronize Your Meters

It can make your job simpler and your graphs look cleaner if the programs that collect the meters are synchronized to the top of the minute. This makes the data easier to combine and compare across systems.loop

Most metering programs or scripts are just a big loop of commands that gather metering data. At the bottom of that loop is usually something that waits until it is time to gather the next round of samples.

Let’s say you want these meters gathered once a minute. If you do the easy thing and just wait for 60 seconds, the meters drift in time because the meters themselves take time to run. At every iteration the meters would start a few seconds later in the minute. What you need to do at the bottom of the minute is wait until the beginning of the next minute. Then your meters will stay in sync with the top of the minute.

What You Need to Know About Any Meter

For any performance meter you need to know four key things about it or it tells you almost nothing. Any assumptions you make can easily lead you astray. If you know these four things, the meter becomes a lot more useful.

Let’s look at a simple little meter and explore our ignorance about it to figure out what we need to know. Imagine you learn this: The application did 3000 writes. Here are the four key things you don’t know.

1. The Time The Meter Was Taken

First and foremost, you need to know when the data was collected because no meter is an island. It is almost never the case that all possible metering data comes from one source. You’ll have to dig into various sources and coordinate with other people. The way you link them together is time.

TimeZone-ClocksIf your meters are collected in different time zones, be sure to note that in the data. When discussing performance issues with people across time zones the most common mistake is to assume that we are all talking about the same point in time when we say: “at 3:05…”

Since time is usually easy to collect and takes up very little space, I always try to record the time starting with the year and going down to the second. 2012-10-23 12:34:23. You may not need that precision for the current question, but someday you may need it to answer a different question.

Adding the time can tell you: The application reported 3000 writes at 3:05:10PM EST on June 19, 2012. You can now compare and contrast this data with all other data sources.

2. The Sample Length of The Meter

Any meter that gives you an averaged value has to average the results over a period of time. The most common averaged value is a utilization number.

The two graphs below show exactly the same data with the only difference being the sample length of the meter. In the chart below the data was averaged every minute. Notice the very impressive spike in utilization in the middle of the graph. During this spike this resource had little left to give.

dailypeak1

In the chart below the same data was averaged every 10-minutes. Notice that the spike almost disappears as the samples were taken at such times that part of the spike was averaged into different samples. Adjusting the sample length can dramatically change the story.

dailypeak2

Some meters just report a count, and you’ve got to know when that count gets reset to zero or rolls over because the value is too big for the variable to hold. Some values start incrementing at system boot, some at process birth.

Some meters calculate the average periodically on their own schedule, and you just sample the current results when you ask for the data. For example, a key utilization meter is calculated once every 60 seconds and, no matter what is going on, the system reports exactly the same utilization figure for the entire 60 seconds. This may sound like a picky detail to you now, but when you need to understand what’s happening in the first 30 seconds of market open, these little details matter.

Adding the sample length, we now know: The application reported 3000 writes between 3:00-3:05:10PM EST on June 19, 2012.

3. What Exactly Is Being Metered

As the old saying goes: “When you assume, you make an ass out of u and me.” Here we have two undefined terms: application and writes.

An “application” is usually many processes that can be spread over many computers. So we need a little more precision here. Where did those 3000 writes come from?  Just one process? All processes on a given system? All processes on all systems?

“Writes” can be measured in bytes, file records, database updates, disk blocks, etc. Some of these have much bigger performance impacts than others.

Even within a given metering tool, it is common to see the same word mean several different things in different places. Consistency is not a strong point in humans. So don’t assume. Ask, investigate, test and double check until you know what these labels mean. The more precisely you understand what the meter measures, the more cool things you can do with it.

Adding the specifics about what is being metered, we now know: All application processes on computer X reported 3000 blocks written to disk Y between 3:00-3:05:10PM EST on June 19, 2012.

4. The Units Used in The Meter

Lastly, pay attention to units. When working with data from multiple sources (and talking to multiple people) it is really easy to confuse the units of speed (milliseconds, microseconds), size (bits vs. bytes), and throughput (things per second or minute) and end up with garbage. It much harder to communicate if I’m talking about “meters per second” and you are talking about “furlongs per fortnight” . It is best to try to standardize your units and use the same ones in all calculations and conversations.

Since 5 minutes is 300 seconds, we can calculate the application was doing an average of 10 writes/second = 3000 writes / 300 seconds.

So finally we can tell you that: All application processes on computer X wrote an average of 10 blocks per second to disk Y between 3:00-3:05:10PM EST on June 19, 2012.

Now we really know something about what’s going on and have a value specified in a common unit we can compare and contrast with other data.

Lastly, here is a brief example illustrating how important correct use of units can be. I know of one company that had to give away about five million dollars worth of hardware on a fixed price bid due to a metering mistake by the technical sales team where kilobytes were confused with megabytes. Ouch. 

There is more to learn about performance work in this blog and in my book
The Every Computer Performance Book, which you can find on Amazon and iTunes.

History Repeating

The worst pain comes from falling in the same hole twice.

When you do the work to solve a performance problem, it is often a good idea to add meters in your ongoing, always-running set of meters, to watch for the return of that problem.

Remember the line from Casablanca? Round up the usual suspects. It is often the case that what caused the previous performance problem is the first thing blamed for the next one.  Every minute spent working on the wrong problem is wasted time and lost opportunity. Having that extra meter running can help you quickly determine if the same old problem is back, or there is a new mystery to solve.

77085-050-37CE7642

All Meters Have Problems

Boost GaugeYou need metering data to do any performance work but metering data never perfectly adds up, never aligns 100%, and typically contains many numbers that are meaningless to you. If you are a type-A detail-oriented person, this can drive you nuts. I urge you to relax just a little bit.

When comparing different meters that look at the same general thing, remember they might be sampling at different points in the operating system, or sampling at different frequencies. They may be reporting different units – for example, a disk read can be reported in bytes, file records, disk blocks, logical IO, or physical IO. They may be counting directly, or sampling indirectly, the values they are reporting. If you don’t understand exactly what the meter is metering, then you are missing a lot of its value.

There is also a non-zero probability that a meter might just be wrong. Just about the last thing that gets added into any system or application are performance meters. Mostly they are added in a hurry, to solve a problem, and most of their output might be utterly useless to anyone who doesn’t work in the vendor’s engineering department. Typically they are not part of the vendor’s quality assurance (QA) work and so, unless someone notices, they can start telling lies as the years pass.

Meters can also lie due to changes in technology over time. For example, in 1800 you might meter the utilization of a road by metering the horses that pass by per hour. That meter might still be around today and working perfectly, but will give the erroneous impression that the metered stretch of road has a rush hour utilization of zero.

Even beautifully crafted third-party performance metering tools can have a hidden problem that can bite you hard in a crisis. If you don’t know the name of the low level meter that is the source of the beautifully presented graphic data before you, then the only people you can discuss this data with are other users of this tool. Most likely the people in support (or the external vendor wizards) have never seen this tool and will ask for the basic operating system meters instead. This can make for an awkward and slow conversation in a time of a performance crisis.

As Things Get Busy

As the user load increases from light to crazy-busy there are a three things I’ve noticed that generally tend to be true over every customer system I’ve ever worked on.

Resource Usage Tends to Scale Linearly

Once you have a trickle of work flowing through the system (which gets programs loaded into memory, buffers initialized and caches filled), it has been my experience that if you give a system X% more work to do, it will burn X% more resources doing that work. It’s that simple. If the transaction load doubles, expect to burn twice the CPU, do twice the disk IO, etc.

There is often talk about algorithmic performance optimizations kicking in at higher load levels, in theory that sounds good. Sadly, most development projects are late, the pressure is high, and the good intentions of the programmers are often left on the drawing board. Once the application works, it ships, and the proposed optimizations are typically forgotten.

Performance Does Not Scale Linearly

Independent parallel processing is a wonderful thing. Imagine two service centers with their own resources doing their own thing and getting the work done. Now twice as much work is coming, so we add two more service centers. You’d expect the response time to stay the same and the throughput to double. Wouldn’t that be nice?

The problem is that at some point in any application all roads lead to a place where key data has to be protected from simultaneous updates or a key resource is shared. At some point more resources don’t help, and the throughput is limited. This is the bad news buried in Amdahl’s Law – which it something you should read more about.

When All Hell Breaks Loose Weird Things Happen

At very high transaction levels many applications can suffer algorithmic breakdown when utterly swamped with work. For example, a simple list of active transactions works well under normal load when there are usually less than ten things on the list, but becomes a performance nightmare once there are 100,000 active transactions on the list. The sort algorithm used to maintain the list was not designed to handle that load. That’s when you see the throughput curve turn down.

tput

This can happen when a source of incoming transactions loses connectivity to you and, while disconnected, it buffers up transactions to be processed later. When the problem is fixed, the transaction source typically sends the delayed transactions at a relentless pace, and your system is swamped until it chews though that backlog. Plans need to be made to take into account these tsunami-like events.

For example, I’ve seen this at banks processing ATM transactions, where normally the overall load changes gradually throughout the day. Now a subsidiary loses communication and then reconnects after a few hours. That subsidiary typically dumps all the stored ATM transactions into the system as fast as the comm lines will move them. Building in some buffering, so that all the pending transactions can’t hit the system at once, can be a smart thing to do here.

Sometimes the tsunami-like load comes as part of disaster recovery, where all the load is suddenly sent to the remaining machine(s).  Here your company needs to decide how much money they want to spend to make these rare events tolerable.

Oddly enough, there is a case where throughput can go up while response time is going down under heavy load. This happens when something is failing and it is never a good thing. How does this happen? It is often faster to fail, than it is to do all the work the transaction requires. It is much faster to say “I give up.” than it is to actually climb the mountain.

bad test

In the graph above, you see the system under heavy load. When throughput (transactions completed) suddenly increases while the average response time drops, start looking for problems. This is a cry for help. The users here are not happy with the results that are receiving.

For More Information:

There is more insights, hints, tricks, and truisms that I gathered over my 25+ year career in performance work in my book: The Every Computer Performance Book

A short, occasionally funny, book on how to solve and avoid application and/or computer performance problems