Deconstructing Response Time

The overall response time is what most people care about. It is the average amount of time it takes for a job (a.k.a. request, transaction, etc.) to get processed.  The two big contributors to response time (ignoring transmission time for the moment) are the service time: the time to do the work and the wait time: the time you waited for your turn to be serviced.  Here is the formula: ResponseTime = WaitTime + ServiceTime

service center 3

If you know the wait time, you can show how much faster things will flow if your company spends the money to fix the problem(s) you’ve discovered. If you know the service time then you know the max throughput as MaxThroughput ≤ 1 / AverageServiceTime 

For example: A key process in your transaction path with an average service time of 0.1 seconds has a maximum throughput of: 1 / 0.1 = 10 per second.

Sadly, response time is the only number that most meters are likely to give you. So how do you find the wait and the service time if there are no meters for them? The service time can be determined by metering the response time under a very light load when there are plenty of resources available. Specifically, when:

  • Transactions are coming in slowly with no overlap
  • There have been a few minutes of warm-up transactions
  • The machines are almost idle

Under these conditions, the response time will equal the service time, as the wait time is approximately zero.

ServiceTime + WaitTime = ResponseTime
ServiceTime + 0 = ResponseTime
ServiceTime = ResponseTime          

The wait time can be calculated under any load by simply subtracting the average service time from the average response time.

WaitTime = ResponseTime – ServiceTime

Performance work is all about time and money. When you’ve found a problem, a question like: “How much better will things be when you fix this?” it is a very reasonable thing for the managers to ask. These simple calculations can help you answer that question.


Other helpful hints can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


When You Are Close To The Edge

acliffAt the Grand Canyon there are many places where you can walk right up to a cliff where, with one more step, you will fall hundreds of feet to your death. The closer you are to the edge of a cliff, the more precisely you need to know your location. In your campsite, a half-mile away, your exact location is not so critical. This is also true in performance work.

If the numbers show a resource will be 20-25% busy at peak, I would not spend more time getting a more precise version of that number. You could be off by a factor of two and the resource would most likely be fine at 40-50% busy. The closer you are to some performance limit, the more careful you have to be with your calculations and predictions.

With any prediction of future behavior there will also be some error, some uncertainly. Some of this is your fault, some of it is the fault of the person who specified the peak load to plan for, and some of it is the fault of the users who didn’t do exactly what was anticipated on that peak day.

When the boss says plan for a peak load that is two times the observed load, do what you are asked. Then, look to see if you are close to “the edge” of some performance cliff. If you are close, go back to the boss and show what you’ve found and ask: “How sure are you about your predicted peak load?

I’ve seen many cases where, when shown how close to the edge a system would be at peak, the decision makers change their minds and give a different number to plan for. Sometimes that number is:

  • Bigger because they want to buy a new stuff
  • Smaller because they don’t want to spend money
  • Bigger to protect the budget for next year
  • Smaller because they just got new growth projections
  • Different than the last number because of the crisis they are dealing with at the moment you happened to ask

It’s your job is to advise, not decide. Present your data, give your best advice, and be at peace. A business decision weighs costs, risks, politics, and the art of what is possible.


This sound advice came from: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

How To Become A Performance Guru

Performance work is a great career as everything change over time and with each change comes new performance challenges. There are always things to do and things to learn. Good performance work can save the company and put your kids though college. Yay!

Bad News… The Path Is Not Easy

This is a hard skill to learn as the knowledge required is diffused throughout many different sources. Let me explain…

First, performance books… Some are built on very difficult math that most people can’t do and most problems don’t require. They unnecessarily discourage many people. Many books focus on a specific product version, but you don’t have that version in your computing world. There is often no performance book for a key part of your transaction path.

Turning to manuals… Almost all manuals focus on a specific version of a technology and were written under tremendous time pressure at about the same time the engineering was being completed; thus the engineers had little time to talk to the writers. The manuals ship with the product. The result is that these books document, but they don’t illuminate. They explain the what, but not the why. They cover the surface, but don’t show the deep connections.

Should you accept the bad news, stop reading here and give up?  I don’t think so. There is hope. Hear me out…

First Of All, Don’t Worry About The Math

For 99% of the performance work out there you don’t need to use complex performance math equations.  The most complex formula I used in 25+ years of performance work is the one that approximately predicts how the response time will change as the utilization of a resource increases:
                    R = S / (1 – U)
If you can replace S with the number 2 and U with the number 0.5 and calculate that R is equal to (spoiler alert) 4, then you have all the math you need for a long career in performance.

Mine Low Grade Information for Gold…

rand_472

The UltraBogus 3000 features fully-puffed marketing literature, a backwards-compatible front door, and a Stooge-enabled, three-idiot architecture that processes transactions with a minimum of efficiency. Its two-bit bus runs conveniently between your office and the repair facility every Tuesday. The steering wheel was added because the marketing VP thought it needed more chrome.

RTFM (Read The ‘Fine’ Manual)

If your company just bought an UltraBogus 3000 (see picture to left) to handle your peak load then read the manuals cover-to-cover. You’ll be surprised what you find.

Sometimes what you find is a limit that is better discovered now than when you blindly hit it at the seasonal peak. Sometimes it is a question you never thought to ask. Sometimes it is a way to make your job vastly easier – even the worst product has some good features. You have to mine a ton of ore to find an ounce of gold.

You’ll (hopefully) be doing this job for years, take 15 minutes a day and chew your way though the manuals.

Lastly, reading the manuals teaches you the vocabulary you need to use when you call tech support. If you want to talk to their wizards, first you need to convince the people who initially take the call that you’re not an idiot.

readRead Performance Books

I wrote one that I think is generally useful and there are many others that will illuminate particular problems and show different ways of solving them.

They all have their strengths and weaknesses, but there is good stuff to be found there. Especially if the company is buying, try reading the ones with scary looking equations. Push yourself into unfamiliar territory. Even if you can’t understand it, having it on your bookshelf will intimidate your enemies. 😉

If you want to be a performance guru then be all you can be. Read.

Search and Connect

Search engines are your friend. If you have a problem with X-technology, then it is highly likely that someone else has too. Ask simple questions and see what comes up. A lot of it is low-grade information, but sometimes you find just the hint you need.

LinkedIn has groups that are focused on every conceivable technology. Join a few a see if you can find a rich vein of information. There is also CMG and websites focused on performance like PracticalPerformanceAanalyst or PerfBytes to explore.

Now comes the tough part…

After you explore the sources above there are still many things of great importance you still won’t know. Performance work is in many ways a skill you teach yourself with the help of others. You have to dive in, like an explorer on a new planet, and try to make sense of the computing world you stand upon. startI’m often asked: Where do I begin?  My answer is to pick a small performance-related thing that interests you and deeply explore it. As you explore, you’ll find other mysteries. Don’t worry about them, just put them on the list. Once you master the first thing, go for the next thing on the list. Over time you’ll have more and more helpful things to contribute and your job will mutate into a performance job. Most performance people start as something else (like a programmer or a sys admin) and slowly move into the performance field. You don’t have to know it all day one. Actually, you never know it all and that is what makes the work interesting.


This blog is based on: The Every Computer Performance Book which is available at AmazonPowell’s Books, and on iTunes.


 

 

When You Care Enough To Do Less

doctor_groucho
It’s the oldest joke in the book…

      Patient: When I do this it hurts.

      Doctor: Well don’t do that.

Sometimes performance work is not about adding hardware or tuning applications, its about doing less and doing it smarter. Send a kilobyte not a megabyte, don’t lock all the records when you don’t need to, etc.

For example, what you put in the files served by your website has a huge impact on performance that no amount of server-side hardware can overcome because you don’t control all the computers/networks between you and the enduser. Many times the only way to fix website response time problems is to send less stuff in a smarter way.

I recently ran across Zoompf.com which has a nice tool to analyze your website and make helpful recommendations to speed it up. They do a good job of explaining why the changes they recommend are important and further provide helpful references to more information about each recommendation.

To avoid mistakes you haven’t made yet, you might also want to read a wonderful little book called High Performance Web Sites by Steve Souders. It points out a lot of small changes that can make a big difference in website performance.

Mostly companies prefer to throw hardware at performance problems, rather than adjust applications, algorithms, or outputs, because it is seen as the low-risk path. Sometimes that works, but sometimes the right thing to recommend is: “don’t do that.”


After you read Steve’s book, try mine: The Every Computer Performance Book at  AmazonPowell’s Books, and on iTunes.


A Career Built On Kindness

John_BlutarskyVERY few people go to college specifically with the goal to be a performance guru.

They start out holding some other job (like sys admin or programmer) and then, by circumstance or desire, slowly move into the performance world.

They learn some performance fundamentals, master a performance tool, notice patterns in the metered data, and slowly pickup the detailed tech-specific knowledge.  They take some chances, make some performance predictions and suggestions and… Voilà, they become a performance guru!

If you are starting your journey, welcome. Keep reading this blog, because I am writing it just for you. Learn, play, explore, and grow.

My best fundamental bit of advice is to be relentlessly kind to those around you. You need their help and can not do this on your own. Once you know something useful, share it. When you can be helpful to others, do it. Be easy to work with. Be kind.

These acts of kindness will help others, but they also help you. First of all, in kindness we find our freedom. Your boss can order you to do many things, but you decide to be kind and in that decision is a freedom that feels very good. Kindness can also be the source of new friendships. Friends will often give you more help than they are required to give because they have seen you do the same. Friends, who move on to other companies, often call you up and give you the inside track on great jobs opportunities.

In my high tech career every single job I ever had came to me through a friend. I was offered these jobs even though I was often missing a key skill-set. This is not because I’m super-smart, or beautiful, it’s because I am kind, helpful, and easy to work with.

Be kind. It will serve you well, and make the world a much nicer place.

kindness

Interactive Computer Latency Numbers Through Time

Go here: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Grab the slider at the top of the screen and see how latency values for common computer tasks have changed starting in 1991 and projected out to 2020. To me, the precise values aren’t as interesting as seeing how the performance battles programmers fight change over time.

Sherman, set the wayback machine to…wayback

The Every Computer Performance Book

coverThis short, occasionally funny, book covers Performance Monitoring, Capacity Planning, Load Testing, Performance Modeling and gives advice on how to get help and present your results effectively.

It works for any application running on any collection of computers you have. It teaches you how to discover more about your meters than the documentation reveals. It only requires the simplest math on your part, yet it allows you to easily use fairly advanced techniques. It is relentlessly practical, buzzword free, and written in a conversational style.

Most of the entries in this blog begin with what I put in the book. The book is available from Amazon in paperback and from Apple in iBook format. Both are priced at ~$9 USD. Why so cheap? Because I retired early (mostly due to my computer performance work) and so I wanted to give back what I learned in the hopes that the next generation can do the same.

 

How To Have Confidence In A Small Sample

It is often the case that we just sample the response times of a few transactions rather than metering all of them. When sampling, how do you know you’ve sampled enough to get an average response time that is representative of all the transactions?

If you make some change to the system, and the average response time falls from 10 seconds to 0.2 seconds, it doesn’t take a rocket scientist to know that is a real improvement. However, if the before and after numbers are reasonably close, it’s not as clear that that change was an improvement. We could have just gotten lucky in our sampling. So, how can we know anything without all the data? Think about a bowl of jellybeans for a minute.

jellybean

Imagine you blindly and randomly select and eat two jellybeans from that bowl. You find one is orange and one is strawberry. You could at this point state that the bowl contains 50% orange and 50% strawberry jellybeans, but you wouldn’t be too confident about it. If the next ten randomly selected jellybeans confirmed the 50/50 ratio then your confidence would grow. However, to be absolutely certain of this ratio, you’d have to eat all the jellybeans in the bowl.

The same is true for any sampled data. The more sampled transactions you have, the more confident you are of your result. To be absolutely sure, you have to measure every transaction. But, how many samples is enough so you can be reasonably sure? For that we are going to have to use statistics. Please don’t panic. We are going to use a couple of simple Excel functions to do the math. Let’s work through an example.

Suppose you are comparing 10 samples of response time data before and 10 samples after an upgrade to see if things are better or worse. Before the upgrade the average response time of 10 transactions was 4.5 seconds and after it was 4.1 seconds. To be sure a small difference is a real difference, you need to calculate the confidence interval. This is a four-step process:

  1. Download/copy the individual samples into a column of an Excel spreadsheet. For this example there ten of them starting at cell A1 going through A10.
  2. Use the AVERAGE function to find the average value (arithmetic mean) of all the samples. This function takes one argument, which is a range of cells containing the response times. For this example AVERAGE(A1:A10) equals 4.5.
  3. Use the STDEV function to find the standard deviation of all of the samples. This function takes one argument, which is a range of cells containing the response times. For this example STDEV(A1:A10).
  4. Use the CONFIDENCE.NORM function to find the confidence interval. This function takes three arguments:
    • Alpha – This is a number between zero and one that tells the function how confident we want to be. The confidence level equals one minus the Alpha. In other words, an Alpha of 0.05 asks for a 95 percent confidence level, which is what we want here.
    • StandardDeviationThe value returned by the STDEV function in step 3.
    • Size – This is the count of individual test results in our sample. In this example the count is 10.

The CONFIDENCE.NORM function returns a number: 0.51. This tells us that we can be 95% confident that the average response time of all transactions during the studied interval before the upgrade (not just the ones we sampled) is 4.50 seconds ± 0.51 seconds.  In other words, we are 95% confident the average pre-upgrade response time is between 3.99 and 5.01 seconds.

Now, let’s say we calculated the confidence interval for the after-the-upgrade data, and the calculations showed we are 95% confident that the actual average response time of all transactions during the studied interval (not just the ones we sampled) is 4.10 seconds ± 0.49 seconds.

So what does this all mean?  If the confidence intervals overlap, there is no statistically significant improvement. As you can see below, they clearly overlap and, even though the after-the-upgrade response times numbers look better, statistics can offer no guarantee of any real improvement. The upgrade might have helped, but you can’t prove it with the data you have to the level of confidence (95%) you want.

beforeafter

This is the same calculation pollsters’ do when they randomly call ~1000 people and, from that small sample, predict how the nation will vote. When these polls are talked about, they rarely quote the ALPHA or the confidence interval. If they did, the lead story of some future newscast might be:

The latest polls are 95% confident that candidate X is polling at 53% and candidate Y is at 48%. The margin of error is ± 5 points so there is no statistically measurable difference and thus we really have no idea who is winning.

Now you might want to be absolutely 100% sure you are seeing an improvement. Statistics can’t help you here because, to be 100% confident, you need to have response time data from ALL the transactions, not just a sample of them. If you have 100% of the data, you don’t need statistics because you have 100% of the data. For most cases, a confidence level of 95% or 98% will do nicely.

 

A short, occasionally funny, book on how to solve and avoid application and/or computer performance problems

This is all I know about statistics, but you’ll find a lot more about doing computer performance work in my book, which you can find on Amazon and iTunes. There are chapters in the book on:
  – Useful Laws & Things I’ve Found To Be True
  – Performance Monitoring
  – Capacity Planning
  – Load Testing
  – Modeling
  – Presenting Your Results

Practical Insights From Queueing Theory

Queuing theory provides a way to predict the average delay when work builds up at a busy device. The calculations are complex, but luckily we can often ignore the math and focus on the key insights this branch of mathematics can bring to things that are busy. First, let’s define a few terms:

Service Center and Service Time

A service center is where the work gets done. To accomplish a given task, it is generally assumed that it takes a service center a fixed amount of time – the service time. In reality this assumption is usually technically false, but still useful.  If work arrives faster than it can be processed a queue builds and the average response time grows because work has to wait in a queue to be serviced.

service center 3

Queuing Theory

As a service center gets busy, it becomes more likely that a newly arriving job will have to wait because there are jobs ahead of it. An approximate formula that describes this relationship is: ResponseTime =  ServiceTime / (1 -Utilization)

The real insight comes from looking at the graph of this function below, as the utilization is goes from 0% to 90%.

resp1

Notice that response time starts out as 1x at idle. At idle the response time always equals the service time as there is nothing to wait for.

Notice that the response time doubles when the service center gets to 50% utilization. At this point, sometimes the arriving jobs finds the service center idle, sometimes they find it with several jobs already waiting, but the effect on the average job is to double the response time as compared to an idle service center. The response time doubles to 4X when the service center is at 75% utilization and doubles again to 8x at around 87% utilization. Assuming you kept pushing more work at the service center, the response time doublings keep getting closer and closer (16x at 94% utilization, 32x at 97% utilization) as the curve turns skyward. All these doublings are created by the fact that the service center is busy, and thus there will often be many jobs waiting ahead of you in the queue.

Insight #1:

The slower the service center, the lower the maximum utilization you should plan for at peak load. The slowest computer resource is going to contribute the most to overall transaction response time increases. Unless you have a paper-tape reader as part of your transaction path, the slowest part of any computer in the early part of the twenty-first century is the rotating, mechanical magnetic disks. At the time of this writing, on an average machine, fetching a 64 bit word from memory was ~50,000x faster than getting it off disk.

The first doubling of response time comes at 50% busy and that is why conventional wisdom shoots for the spinning magnetic disks to be no more than 50% busy at peak load.  Think about it this way, at 50% busy you are doubling the response time of the slowest part of your transaction path – that has got to hurt. If the boss insists that you run the disk up to 90% busy then the average response time for a disk read is about 10X larger than if the drive was idle. Ouch!

Insight #2:

It’s very hard to use the last 15% of anything. As the service center gets close to 100% utilization the response time will get so bad for the average transaction that nobody will be having any fun. The graph below is exactly the same situation as the previous graph except this graph is plotted to 99% utilization. At 85% utilization the response time is about 7x and it just gets worse from there.

resp2

Insight #3:

The closer you are to the edge, the higher the price for being wrong. Imagine your plan called for a peak of 90% CPU utilization on the peak hour of your peak day but the users didn’t read the plan. They worked the machine 10% harder than anticipated and drove the single CPU to 99% utilization. Your average response time for that service center was planned to be 10x, instead it is 100x. Ouch!  This is a key reason that you want to build a safety cushion into any capacity plan.

Insight #4:

Response time increases are limited by the number that can wait. Mathematically, the queuing theory calculations predict that at 100% utilization you will see close to an infinite response time. That is clearly ridiculous in the real world as there are not an infinite number of users to send in work.

The max response time for any service center is limited by the total number of possible incoming requests. If, at worst case, there can only be 20 requests in need of service, then the maximum possible response time is 20x the service time. If you are the only process using a service center, no matter how much work you send it, there will be no queuing-based increase in response time because no one is ever ahead of you in line.

Insight #5:

Remember this is an average, not a maximum. If a single service center is at 75% utilization, then the average response time will be 4x the service time. Now a specific job might arrive when the service center is idle (no wait time) or it might arrive when there are dozens of jobs ahead of it to be processed (huge wait time).

The higher the utilization of the service center the more likely you are to see really ugly wait times and have trouble meeting your service level agreements. This is especially true if your service level agreements are written to specify that no transaction will take longer than X seconds.

Insight #6:

There is a human denial effect in multiple service centers. If there are multiple service centers that can handle the incoming work, then, as you push the utilization higher, the response time stays lower longer. Eventually the curve has to turn and when it does so the turn is sudden and sharp!

resp3

This effect makes sense if you think about buying groceries in a MegaMart. If at check out time seven cashiers are busy and three are idle, you go to the idle cashier. Even though the checkout service center is 70% busy overall, your wait time is often zero, and your response time is equal to the service time. Life is good.

If you have a computer with eight available CPUs the response time will stay close to the service time as the CPU busy climbs to around 90%. At this point the response time curve turns sharply. At 95% busy, and the system becomes a world of response time pain. So, for resources with multiple service centers, you can run them hotter than single service center resources, but you have to be prepared to add capacity quickly or suffer horrendous jumps in response time. Most companies are much better at understanding real pain they are experiencing now, as opposed to future pain they may experience if they don’t spend lots of money now.

For More Information:

To begin to explore the mathematical underpinnings of this post you can begin here http://en.wikipedia.org/wiki/Little’s_law and herehttp://en.wikipedia.org/wiki/Queueing_theory, but it is a long way down those rabbit holes.

There are more easy performance-related mathematical insights in my book:
The Every Computer Performance Book

A short, occasionally funny, book on how to solve and avoid application and/or computer performance problems

Why Read This Blog

I’m not making this up.

I was in the big meeting before they let me on the live system that was at the very core of the second largest stock exchange in the United States. Everyone was there including the CIO. The meeting went smoothly and was very professional. When the meeting ended, the room cleared except for me and a powerfully built young man who was the lead system administrator. He got right in my face and in a clearly ominous tone quietly said, “Don’t fuck up the computer!”

On another day, at another business, the CEO asked me into his office and quietly told me: “If you do not have this problem fixed by the end of the week, I will have to lay everyone off and sell the building.” He was as serious as the grave.

On another day, on a different continent, I discovered the root of a huge problem a credit card company was having. A trivial change in the source code made a key transaction run approximately 200 times faster. The ensuing celebration was epic.

Performance work can make you a hero, it can save the company, and it can get you threatened, as well. However, most of the time it is remarkably ordinary. You gather data. You work to understand what it’s telling you. You present your conclusions. If you do your work right, most of the time, there is no drama at all.

Any average person can do basic performance work. You can read the obvious meters and write nice little reports. If you have an inquisitive mind and the willingness to dig for the hidden truth, then you can go beyond the obvious meters and do great work – the kind of work that saves the company.

I am at the end of my career, but before I walk off the stage I hope in this blog to give back the hints, tricks, knowledge, and wisdom so many have generously given to me.

The stuff I will write about works on any collection of computers, running any application on any operating system. It is the essence of what is true, and what works, in any performance related situation regardless of the technology involved.

I hope that you find this useful.