Deconstructing Response Time

The overall response time is what most people care about. It is the average amount of time it takes for a job (a.k.a. request, transaction, etc.) to get processed.  The two big contributors to response time (ignoring transmission time for the moment) are the service time: the time to do the work and the wait time: the time you waited for your turn to be serviced.  Here is the formula: ResponseTime = WaitTime + ServiceTime

service center 3

If you know the wait time, you can show how much faster things will flow if your company spends the money to fix the problem(s) you’ve discovered. If you know the service time then you know the max throughput as MaxThroughput ≤ 1 / AverageServiceTime 

For example: A key process in your transaction path with an average service time of 0.1 seconds has a maximum throughput of: 1 / 0.1 = 10 per second.

Sadly, response time is the only number that most meters are likely to give you. So how do you find the wait and the service time if there are no meters for them? The service time can be determined by metering the response time under a very light load when there are plenty of resources available. Specifically, when:

  • Transactions are coming in slowly with no overlap
  • There have been a few minutes of warm-up transactions
  • The machines are almost idle

Under these conditions, the response time will equal the service time, as the wait time is approximately zero.

ServiceTime + WaitTime = ResponseTime
ServiceTime + 0 = ResponseTime
ServiceTime = ResponseTime          

The wait time can be calculated under any load by simply subtracting the average service time from the average response time.

WaitTime = ResponseTime – ServiceTime

Performance work is all about time and money. When you’ve found a problem, a question like: “How much better will things be when you fix this?” it is a very reasonable thing for the managers to ask. These simple calculations can help you answer that question.

Other helpful hints can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.

The Every Computer Performance Book

coverThis short, occasionally funny, book covers Performance Monitoring, Capacity Planning, Load Testing, Performance Modeling and gives advice on how to get help and present your results effectively.

It works for any application running on any collection of computers you have. It teaches you how to discover more about your meters than the documentation reveals. It only requires the simplest math on your part, yet it allows you to easily use fairly advanced techniques. It is relentlessly practical, buzzword free, and written in a conversational style.

Most of the entries in this blog begin with what I put in the book. The book is available from Amazon in paperback and from Apple in iBook format. Both are priced at ~$9 USD. Why so cheap? Because I retired early (mostly due to my computer performance work) and so I wanted to give back what I learned in the hopes that the next generation can do the same.


A Cry For Help

As the incoming work pushes your computing world beyond its limits, this is the throughput graph that you’ll most likely see. Learn to recognize it as a cry for help.


On an idle system, when work shows up, it gets processed right away and exits. ResponseTime = ServiceTime. So early on, as the workload starts to build for the day, the arrival rate of work equals the throughput. This happy circumstance continues until some part of the transaction path can no longer keep up with the arriving work. Now the following things start happening:

  • Work has to wait as the computing resource is busy.
  • Response Time climbs as it now includes significant Wait time
  • The throughput stops matching the arrival rate of new work

At some point the throughput stops going up and can actually go down as algorithms are pushed beyond their design limits and become dysfunctional.

When you see the throughput of the system flatten out like this, somewhere in the transaction path a resource is 100% busy. This is a cry for help. Learn to recognize it.

There is more to learn about this and other subjects in my book:
The Every Computer Performance Book


Practical Insights From Queueing Theory

Queuing theory provides a way to predict the average delay when work builds up at a busy device. The calculations are complex, but luckily we can often ignore the math and focus on the key insights this branch of mathematics can bring to things that are busy. First, let’s define a few terms:

Service Center and Service Time

A service center is where the work gets done. To accomplish a given task, it is generally assumed that it takes a service center a fixed amount of time – the service time. In reality this assumption is usually technically false, but still useful.  If work arrives faster than it can be processed a queue builds and the average response time grows because work has to wait in a queue to be serviced.

service center 3

Queuing Theory

As a service center gets busy, it becomes more likely that a newly arriving job will have to wait because there are jobs ahead of it. An approximate formula that describes this relationship is: ResponseTime =  ServiceTime / (1 -Utilization)

The real insight comes from looking at the graph of this function below, as the utilization is goes from 0% to 90%.


Notice that response time starts out as 1x at idle. At idle the response time always equals the service time as there is nothing to wait for.

Notice that the response time doubles when the service center gets to 50% utilization. At this point, sometimes the arriving jobs finds the service center idle, sometimes they find it with several jobs already waiting, but the effect on the average job is to double the response time as compared to an idle service center. The response time doubles to 4X when the service center is at 75% utilization and doubles again to 8x at around 87% utilization. Assuming you kept pushing more work at the service center, the response time doublings keep getting closer and closer (16x at 94% utilization, 32x at 97% utilization) as the curve turns skyward. All these doublings are created by the fact that the service center is busy, and thus there will often be many jobs waiting ahead of you in the queue.

Insight #1:

The slower the service center, the lower the maximum utilization you should plan for at peak load. The slowest computer resource is going to contribute the most to overall transaction response time increases. Unless you have a paper-tape reader as part of your transaction path, the slowest part of any computer in the early part of the twenty-first century is the rotating, mechanical magnetic disks. At the time of this writing, on an average machine, fetching a 64 bit word from memory was ~50,000x faster than getting it off disk.

The first doubling of response time comes at 50% busy and that is why conventional wisdom shoots for the spinning magnetic disks to be no more than 50% busy at peak load.  Think about it this way, at 50% busy you are doubling the response time of the slowest part of your transaction path – that has got to hurt. If the boss insists that you run the disk up to 90% busy then the average response time for a disk read is about 10X larger than if the drive was idle. Ouch!

Insight #2:

It’s very hard to use the last 15% of anything. As the service center gets close to 100% utilization the response time will get so bad for the average transaction that nobody will be having any fun. The graph below is exactly the same situation as the previous graph except this graph is plotted to 99% utilization. At 85% utilization the response time is about 7x and it just gets worse from there.


Insight #3:

The closer you are to the edge, the higher the price for being wrong. Imagine your plan called for a peak of 90% CPU utilization on the peak hour of your peak day but the users didn’t read the plan. They worked the machine 10% harder than anticipated and drove the single CPU to 99% utilization. Your average response time for that service center was planned to be 10x, instead it is 100x. Ouch!  This is a key reason that you want to build a safety cushion into any capacity plan.

Insight #4:

Response time increases are limited by the number that can wait. Mathematically, the queuing theory calculations predict that at 100% utilization you will see close to an infinite response time. That is clearly ridiculous in the real world as there are not an infinite number of users to send in work.

The max response time for any service center is limited by the total number of possible incoming requests. If, at worst case, there can only be 20 requests in need of service, then the maximum possible response time is 20x the service time. If you are the only process using a service center, no matter how much work you send it, there will be no queuing-based increase in response time because no one is ever ahead of you in line.

Insight #5:

Remember this is an average, not a maximum. If a single service center is at 75% utilization, then the average response time will be 4x the service time. Now a specific job might arrive when the service center is idle (no wait time) or it might arrive when there are dozens of jobs ahead of it to be processed (huge wait time).

The higher the utilization of the service center the more likely you are to see really ugly wait times and have trouble meeting your service level agreements. This is especially true if your service level agreements are written to specify that no transaction will take longer than X seconds.

Insight #6:

There is a human denial effect in multiple service centers. If there are multiple service centers that can handle the incoming work, then, as you push the utilization higher, the response time stays lower longer. Eventually the curve has to turn and when it does so the turn is sudden and sharp!


This effect makes sense if you think about buying groceries in a MegaMart. If at check out time seven cashiers are busy and three are idle, you go to the idle cashier. Even though the checkout service center is 70% busy overall, your wait time is often zero, and your response time is equal to the service time. Life is good.

If you have a computer with eight available CPUs the response time will stay close to the service time as the CPU busy climbs to around 90%. At this point the response time curve turns sharply. At 95% busy, and the system becomes a world of response time pain. So, for resources with multiple service centers, you can run them hotter than single service center resources, but you have to be prepared to add capacity quickly or suffer horrendous jumps in response time. Most companies are much better at understanding real pain they are experiencing now, as opposed to future pain they may experience if they don’t spend lots of money now.

For More Information:

To begin to explore the mathematical underpinnings of this post you can begin here’s_law and here, but it is a long way down those rabbit holes.

There are more easy performance-related mathematical insights in my book:
The Every Computer Performance Book

A short, occasionally funny, book on how to solve and avoid application and/or computer performance problems

Digging Into Response Time

stopwatchIf you have response time data, there are some really interesting questions you can answer about the total amount of time spent waiting and theoretical max throughput that can be achieved.

Here we will look at a couple of them.

Response Time

Response time is the total amount of time you waited for something you asked for. Here is an example:

As far as your taste buds were concerned, the response time for your cookie request was five-seconds. If Mom had been busy doing other things, then you would have had to wait for her to get your cookie and that would lengthen the response time.


Utilization is the technical term for “busy” and is typically expressed as a decimal fraction with a range between zero and one. A 45% busy resource has a utilization of 0.45. Nothing can be more than 100% busy. No matter how much your boss wants it to be so, there is no 110% to give.

As utilization goes up, the response time also tends to go up – keep reading to find out why. Only a fool would plan for a service center to be 100% busy, as there is no margin for error and the incoming work never arrives at the expected rate.

Service Time

A service center is where the work gets done. CPUs, processes, and disks are examples of service centers. service centerTo accomplish a given task, it is generally assumed that it takes a service center a fixed amount of time – the service time. In reality this assumption is usually false, but still very useful. The crafty people who designed your hardware and software typically put a few optimizations in the design. If you could meter every job going through a service center you’d find that the amount of time and effort to accomplish each “identical” job is somewhat variable. Having said this, it is still a useful abstraction to think about each identical task taking an identical amount of time to be serviced at the service center. Just as you don’t require quantum mechanics to predict the flight path of a baseball, you can mostly ignore the individual variations and focus on the big picture.

Averaged over time, a service center can have a utilization from zero to one or, if you prefer, 0% to 100% busy.

You are always interested in the utilization averaged over a short period of time, i.e., seconds or minutes. You are never interested in the instantaneous utilization (it is always 0 or 1) and are rarely interested in the utilization averaged over long periods (hours, days, etc.) like a month because that long an average can hide serious shenanigans and suffering.

You can set the boundaries of a service center anywhere you like. A service center can be a simple process, or the entire computer, or an entire array of computers. For that matter a service center can be an oven. The service time for baking bread = 30 minutes at 350°F. A service center is where work gets done, and you get to define the boundaries.

Arrivals and Throughput

Work arrives at a service center and, when processing is complete, it exits. The work is composed of discrete things to do that might be called transactions, jobs, packets, tasks, or IO’s, depending on the context.

service center 2

The rate at which tasks arrive at the service center is the arrival rate. The rate at which tasks exit a service center is called the throughput. In performance work, most of the time these values are measured over a period of a second or a minute and occasionally over a longer period of up to an hour.

To stay out of trouble, be sure that you don’t confuse these terms and keep your units of time straight. Arrivals are not the same as throughput, as anyone knows whose ever been stuck in a long airport security line. If you accidentally mix “per second” and “per minute” values in some calculation, then badness will ensue. Try not to do that.

Wait Time

Unless you are reading this in a post-apocalyptic world where you are the only survivor, there will be times when tasks arrive at a faster rate than the service center can process them. Any task that arrives while the service center is busy has to wait before it can be serviced. The busier the service center is, the higher the likelihood that new jobs will have to wait.

The upper limit on wait time is controlled by two things: the maximum number of simultaneous arrivals and the service time. If ten tasks arrive simultaneously at an idle service center where the service time is 10 milliseconds, then the first task gets in with zero wait time, the last job will wait for 90 milliseconds. The average wait time for all these tasks is:

45ms = (0+10+20+30+40+50+60+70+80+90) / 10

The overall (or average) response time is what most people care about. It is the average amount of time it takes for a job (a.k.a. request, transaction, etc.) to wait for service plus the service time itself. If the user is geographically separated from the service center then you have to add in transmission time, but we’ll save that for a different post.

service center 3

Finding Service Time

As you’ll see shortly, the wait and the service time are wildly useful numbers to know, but the response time is the only number that most meters, if they provide that data at all, are likely to give you. So how do you dig out the wait and the service time if there are no meters for them?

The service time can be determined by metering the response time under a very light load when there are plenty of resources available. Specifically, when:

  • Transactions are coming in slowly with no overlap
  • There have been a few minutes of warm-up transactions
  • The machines are almost idle

Under these conditions, the response time will equal the service time, as the wait time is approximately zero.

ServiceTime + WaitTime  =  ResponseTime
ServiceTime + 0  =  ResponseTime
ServiceTime  =  ResponseTime

Finding Wait Time

The wait time can be calculated under any load by simply subtracting the average service time from the average response time. This is a useful calculation to do as it shows you how much better things could be if all the wait time was cleared up. Performance work, at some level, is all about time and money. If you know the wait time, you can show how much time a customer might save if your company spent the money to fix the performance problem(s) you’ve discovered.

Finding the Maximum Throughput

If you know the service time, you can find the maximum throughput because:

     MaxThroughput  ≤  1 / AverageServiceTime

A service center with an average service time of 0.05 seconds has a maximum throughput of: 1 / 0.05 =  20 per second.

CautionCAUTION: With this calculation you have to be a bit careful when you have a broadly defined service center. For example, a Google search I did for the word “cat” returned after 0.25 seconds. This value was reasonably constant when tested very early in the morning on a holiday weekend so we can assume that the utilization of the Google servers is fairly low. Using the above formula, we can scientifically show that the maximum throughput for Google is four searches per second. Clearly that is not right. So, is this rule wrong?  No, it was just used in the wrong place. Google has a massively parallel architecture, and so we are not looking at just one service center.  Here we got a reasonable Average Service Time, did the calculation, and came up with a Max Throughput number that made no sense. With all these tools the most important things you bring to the party are common sense and a skeptical eye.

For other useful performance insights, and the occasional funny story, please check out: The Every Computer Performance Book which is also available at B&N, Powell’s, and on iTunes.


The Law of The Minimum

Long before computers, two chemists named Sprengel and Liebig were working in the area of agricultural chemistry. Among their many accomplishments, Sprengel pioneered and Liebig popularized the Law of the Minimum, which states that growth is limited by the least available resource. If a plant is starving for nitrogen, then only additional nitrogen will get things growing again. Everything else you give it doesn’t help. Oddly enough the Law of the Minimum applies to computers, too.

When you are out of X, adding anything else but X won’t help at all. Extra resources may help new transactions move through the system quicker, but all that really does is hurry them to the X bottleneck, where they will find themselves at the end of a very long line of transactions waiting their turn. If you screw up and add the wrong resource, the queue of waiting tasks at the least available resource may just get longer and there will be no positive effect on throughput or response time.

In the biological sciences too much of something (water, warmth, etc.) can kill you just as easily as too little.  In computers, if you have too much of some resource there is no negative effect on the overall capacity of the system to get things done. Every so often I run into people who believe the myth that a system has too much of a given resource and that somehow hurts performance. It takes considerable time to talk them out of this belief. Be patient with them. It is the case that when something suddenly and rapidly dumps work into the system (e.g. a system coming back online after a comm failure) then that can cause a performance disruption. But the right fix for that is to put a bit of flow control in so these rare events don’t drown the system in transactions and kill performance.

In business, money is often the most limited resource. If the system has too much of some resource, you are wasting money. The trick is to always have just enough resources in place to handle the peak plus a bit more as a margin of safety. Any fool can solve bottlenecks with limitless money.

The Hidden Bottleneck

When you run out of some resource, that resource becomes a bottleneck. All the transactions race through the system only to find a huge queue because of that resource limitation.

The double-necked hourglass illustration shows a bottleneck at point A. Beyond that bottleneck life is easy for the rest of the system as, no matter how many transactions arrive, the workload is throttled by the upstream bottleneck.


If you “fix” bottleneck A then performance will be really good for only about 45 milliseconds until that great load of transactions hits bottleneck B with a sickening “WHUMP!”  The throughput of this system will hardly change at all, and you will have some explaining to do in the boardroom.

When capacity planning, it is important to explain this drawing to the decision makers so they comprehend how one bottleneck can hide a downstream bottleneck. It is also key for you to meter all the resources you can deplete, not just the one that is the obvious bottleneck.

If your response time is growing, then there is a bottleneck somewhere. If the meters you are looking at (for bottleneck B) are showing lots of capacity, then you are looking in the wrong place. However, it is important to look at them as they can tell about your future. If you need this hourglass to do 10X the work it is doing now, and the meters for bottleneck B are showing that part of the system is 50% busy, then there is no way that part of the system can do 10x the work. Your current problem may lie elsewhere, but you’d better put bottleneck B on your to-do list.

The hourglass could easily have been drawn with many more bottlenecks, but I’ve never seen a performance problem where there were more than two bottlenecks that had to be cleared up to get the needed throughput. If you are working on the fourth bottleneck for this given problem, then perhaps you should spend some time thinking about a new career – because you are most likely deep in the weeds.

For more information on finding, fixing, and avoiding bottlenecks, as well as capacity planning, I’d suggest you read my book The Every Computer Performance Book

For more details on Liebig’s Law of the Minimum see: