Three Tools You Should Build

Given that it is a good idea to keep an eye on performance all the time, there are lots of companies that only allow you pay periodic attention to performance. They focus on it when there is a problem, or before the annual peak, but the rest of the year they give you other tasks to work on.

toolsThis is a lot like my old job in Professional Services – A customer has a problem, I fly in, find the trouble, and then don’t see them until the next problem crops up.

To do that job I relied on three tools that I created for myself and that you might start building to help you work on periodic performance problems.

Three Tools

List All – The first tool would dig through the system and list all the things that could be known about the system: config options, OS release, IO, network, number of processes, what files were open, etc. The output was useful by itself as now I had looked in every corner of the system and knew what I was working on. Several times it saved me days of work as the customer had initially logged me into the wrong system. It always made my work easier as I had all the data I needed, in one place, conveniently organized, and in a familiar order.

Changes – If I’d been to this customer before, this tool allowed me to compare the state of the system with the previous state. It just read through the output of the List All I’d just done and compared it with the data I collected on my last visit. Boy, was this useful as I could quickly check the customer’s assurance that “Nothing had changed since my last visit.” I remember the shocked look on the customer’s face when I asked: “Why did you downgrade memory?”

Odd Things – Most performance limiting, or availability threatening, behavior is easy to spot. But for any OS, and any application, there are some things that can really hurt performance that you have to dig for in odd places with obscure meters. These are a pain to look for and are rare, so nobody looks for them. Through the years as I discovered each odd thing, I would write a little tool to help me detect the problem and then I’d add that tool to the end of my odd things tool.  I’d run this tool on every customer system I looked at and, on occasion, I would find something that surprised everyone: “You haven’t backed up this system in over a year.” or solved a performance problem by noticing a foolish less than optimal configuration choice.

With most everything happening on servers somewhere in the net/cloud these days, knowing exactly where you are and what you’ve got to work with is important. Being able to quickly gather that data in a matter of minutes allows you to focus on the problem at hand confident that you’ve done a through job.

Stone_SoupAll three of these tools were built slowly over time. Get started with a few simple things. The output of all three is just text – no fancy GUI interface or pretty plots are required. When you have time, write the code to gather the next most useful bit of information and add that to your tool.

Just like the old folk story of stone soup, your tools will get built over time with the contributions of others. Remember to thank each contributor for the gifts they give you and share what you have freely with others.


Other useful hints can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

 

The Five Minute Rule

I used to travel to companies to work on their performance problems. Once I arrived, and we had gone through the initial pleasantries, I would ask them a simple question:

How busy is your system right now?

If the person I asked had a ballpark estimate that they quickly confirmed with a meter, I’d know that whatever problem they called me to solve would be an obscure one that would take some digging to find. If they had no idea, or only knew the answer to a resolution of an entire day, then I was pretty sure that there would be plenty of performance-related ugliness to discover. This little question became the basis for a rule of thumb that I used my entire career.  five

The less a company knows about the work their system did in the last five minutes, the more deeply screwed up they are.

Your job is likely different than mine was, but there is a general truth here. If the staff can easily access data that tells them in the last few minutes how much work is flowing through the system (transactions, page views, etc.) and how the system is reacting to that work (key utilization numbers, queue depths, error counts, etc.), then they have the data to isolate and solve most common performance problems. Awareness is curative. The company will solve more of it own performance problems, see future performance problems coming sooner, and spend less money on outside performance experts.

 


Other rules of thumb can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

Deconstructing Response Time

The overall response time is what most people care about. It is the average amount of time it takes for a job (a.k.a. request, transaction, etc.) to get processed.  The two big contributors to response time (ignoring transmission time for the moment) are the service time: the time to do the work and the wait time: the time you waited for your turn to be serviced.  Here is the formula: ResponseTime = WaitTime + ServiceTime

service center 3

If you know the wait time, you can show how much faster things will flow if your company spends the money to fix the problem(s) you’ve discovered. If you know the service time then you know the max throughput as MaxThroughput ≤ 1 / AverageServiceTime 

For example: A key process in your transaction path with an average service time of 0.1 seconds has a maximum throughput of: 1 / 0.1 = 10 per second.

Sadly, response time is the only number that most meters are likely to give you. So how do you find the wait and the service time if there are no meters for them? The service time can be determined by metering the response time under a very light load when there are plenty of resources available. Specifically, when:

  • Transactions are coming in slowly with no overlap
  • There have been a few minutes of warm-up transactions
  • The machines are almost idle

Under these conditions, the response time will equal the service time, as the wait time is approximately zero.

ServiceTime + WaitTime = ResponseTime
ServiceTime + 0 = ResponseTime
ServiceTime = ResponseTime          

The wait time can be calculated under any load by simply subtracting the average service time from the average response time.

WaitTime = ResponseTime – ServiceTime

Performance work is all about time and money. When you’ve found a problem, a question like: “How much better will things be when you fix this?” it is a very reasonable thing for the managers to ask. These simple calculations can help you answer that question.


Other helpful hints can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


How To Become A Performance Guru

Performance work is a great career as everything change over time and with each change comes new performance challenges. There are always things to do and things to learn. Good performance work can save the company and put your kids though college. Yay!

Bad News… The Path Is Not Easy

This is a hard skill to learn as the knowledge required is diffused throughout many different sources. Let me explain…

First, performance books… Some are built on very difficult math that most people can’t do and most problems don’t require. They unnecessarily discourage many people. Many books focus on a specific product version, but you don’t have that version in your computing world. There is often no performance book for a key part of your transaction path.

Turning to manuals… Almost all manuals focus on a specific version of a technology and were written under tremendous time pressure at about the same time the engineering was being completed; thus the engineers had little time to talk to the writers. The manuals ship with the product. The result is that these books document, but they don’t illuminate. They explain the what, but not the why. They cover the surface, but don’t show the deep connections.

Should you accept the bad news, stop reading here and give up?  I don’t think so. There is hope. Hear me out…

First Of All, Don’t Worry About The Math

For 99% of the performance work out there you don’t need to use complex performance math equations.  The most complex formula I used in 25+ years of performance work is the one that approximately predicts how the response time will change as the utilization of a resource increases:
                    R = S / (1 – U)
If you can replace S with the number 2 and U with the number 0.5 and calculate that R is equal to (spoiler alert) 4, then you have all the math you need for a long career in performance.

Mine Low Grade Information for Gold…

rand_472

The UltraBogus 3000 features fully-puffed marketing literature, a backwards-compatible front door, and a Stooge-enabled, three-idiot architecture that processes transactions with a minimum of efficiency. Its two-bit bus runs conveniently between your office and the repair facility every Tuesday. The steering wheel was added because the marketing VP thought it needed more chrome.

RTFM (Read The ‘Fine’ Manual)

If your company just bought an UltraBogus 3000 (see picture to left) to handle your peak load then read the manuals cover-to-cover. You’ll be surprised what you find.

Sometimes what you find is a limit that is better discovered now than when you blindly hit it at the seasonal peak. Sometimes it is a question you never thought to ask. Sometimes it is a way to make your job vastly easier – even the worst product has some good features. You have to mine a ton of ore to find an ounce of gold.

You’ll (hopefully) be doing this job for years, take 15 minutes a day and chew your way though the manuals.

Lastly, reading the manuals teaches you the vocabulary you need to use when you call tech support. If you want to talk to their wizards, first you need to convince the people who initially take the call that you’re not an idiot.

readRead Performance Books

I wrote one that I think is generally useful and there are many others that will illuminate particular problems and show different ways of solving them.

They all have their strengths and weaknesses, but there is good stuff to be found there. Especially if the company is buying, try reading the ones with scary looking equations. Push yourself into unfamiliar territory. Even if you can’t understand it, having it on your bookshelf will intimidate your enemies. 😉

If you want to be a performance guru then be all you can be. Read.

Search and Connect

Search engines are your friend. If you have a problem with X-technology, then it is highly likely that someone else has too. Ask simple questions and see what comes up. A lot of it is low-grade information, but sometimes you find just the hint you need.

LinkedIn has groups that are focused on every conceivable technology. Join a few a see if you can find a rich vein of information. There is also CMG and websites focused on performance like PracticalPerformanceAanalyst or PerfBytes to explore.

Now comes the tough part…

After you explore the sources above there are still many things of great importance you still won’t know. Performance work is in many ways a skill you teach yourself with the help of others. You have to dive in, like an explorer on a new planet, and try to make sense of the computing world you stand upon. startI’m often asked: Where do I begin?  My answer is to pick a small performance-related thing that interests you and deeply explore it. As you explore, you’ll find other mysteries. Don’t worry about them, just put them on the list. Once you master the first thing, go for the next thing on the list. Over time you’ll have more and more helpful things to contribute and your job will mutate into a performance job. Most performance people start as something else (like a programmer or a sys admin) and slowly move into the performance field. You don’t have to know it all day one. Actually, you never know it all and that is what makes the work interesting.


This blog is based on: The Every Computer Performance Book which is available at AmazonPowell’s Books, and on iTunes.


 

 

When You Care Enough To Do Less

doctor_groucho
It’s the oldest joke in the book…

      Patient: When I do this it hurts.

      Doctor: Well don’t do that.

Sometimes performance work is not about adding hardware or tuning applications, its about doing less and doing it smarter. Send a kilobyte not a megabyte, don’t lock all the records when you don’t need to, etc.

For example, what you put in the files served by your website has a huge impact on performance that no amount of server-side hardware can overcome because you don’t control all the computers/networks between you and the enduser. Many times the only way to fix website response time problems is to send less stuff in a smarter way.

I recently ran across Zoompf.com which has a nice tool to analyze your website and make helpful recommendations to speed it up. They do a good job of explaining why the changes they recommend are important and further provide helpful references to more information about each recommendation.

To avoid mistakes you haven’t made yet, you might also want to read a wonderful little book called High Performance Web Sites by Steve Souders. It points out a lot of small changes that can make a big difference in website performance.

Mostly companies prefer to throw hardware at performance problems, rather than adjust applications, algorithms, or outputs, because it is seen as the low-risk path. Sometimes that works, but sometimes the right thing to recommend is: “don’t do that.”


After you read Steve’s book, try mine: The Every Computer Performance Book at  AmazonPowell’s Books, and on iTunes.


Standards

The brilliant comic XKCD perfectly describes my feelings about standards in performance work with this simple cartoon:

standardsThere are many paths to the top of the performance mountain. If, whatever you believe is the right way to do performance work, gets you to the top then go with that.

K2_south_routes

But, while climbing the mountain, keep your eyes open and notice what the other climbers are doing. If you see some cool climbing trick, learn it. Whenever possible, share what you know and help those that are struggling. If you are struggling, that is often natures way of telling you to open your mind other options, opinions, pathways, and climbing techniques.

It is not about the dogma, its about getting the work done and keeping the users happy.

 

Correlation Is Not Causality

researchcausation

When gathering performance data it is important to note that correlation does not prove causality. A strong correlation shows the compared values move up/down together, but does not prove (or disprove) a cause and effect relationship. If two performance meters (X and Y) are correlated it could be the case that X causes Y, or that Y causes X, or that changes in both X and Y are caused by some external force.

Notice below how well the Divorce rate in Maine correlates with Per capita consumption of margarine in the US. If you can figure out, and clearly show, how these two variables are causally linked, then a Nobel might be in your future.

correlation

Noticing correlation between the activity you see in various meters is an interesting clue to many performance questions. It’s a good place to start wondering, but before you make up your mind about a causal link between X and Y, gather more data and do it under differing circumstances. Figure out the mechanism for how X causes Y. Look for evidence that supports (and/or destroys) your X causes Y hypothesis.

 

How To Have Confidence In A Small Sample

It is often the case that we just sample the response times of a few transactions rather than metering all of them. When sampling, how do you know you’ve sampled enough to get an average response time that is representative of all the transactions?

If you make some change to the system, and the average response time falls from 10 seconds to 0.2 seconds, it doesn’t take a rocket scientist to know that is a real improvement. However, if the before and after numbers are reasonably close, it’s not as clear that that change was an improvement. We could have just gotten lucky in our sampling. So, how can we know anything without all the data? Think about a bowl of jellybeans for a minute.

jellybean

Imagine you blindly and randomly select and eat two jellybeans from that bowl. You find one is orange and one is strawberry. You could at this point state that the bowl contains 50% orange and 50% strawberry jellybeans, but you wouldn’t be too confident about it. If the next ten randomly selected jellybeans confirmed the 50/50 ratio then your confidence would grow. However, to be absolutely certain of this ratio, you’d have to eat all the jellybeans in the bowl.

The same is true for any sampled data. The more sampled transactions you have, the more confident you are of your result. To be absolutely sure, you have to measure every transaction. But, how many samples is enough so you can be reasonably sure? For that we are going to have to use statistics. Please don’t panic. We are going to use a couple of simple Excel functions to do the math. Let’s work through an example.

Suppose you are comparing 10 samples of response time data before and 10 samples after an upgrade to see if things are better or worse. Before the upgrade the average response time of 10 transactions was 4.5 seconds and after it was 4.1 seconds. To be sure a small difference is a real difference, you need to calculate the confidence interval. This is a four-step process:

  1. Download/copy the individual samples into a column of an Excel spreadsheet. For this example there ten of them starting at cell A1 going through A10.
  2. Use the AVERAGE function to find the average value (arithmetic mean) of all the samples. This function takes one argument, which is a range of cells containing the response times. For this example AVERAGE(A1:A10) equals 4.5.
  3. Use the STDEV function to find the standard deviation of all of the samples. This function takes one argument, which is a range of cells containing the response times. For this example STDEV(A1:A10).
  4. Use the CONFIDENCE.NORM function to find the confidence interval. This function takes three arguments:
    • Alpha – This is a number between zero and one that tells the function how confident we want to be. The confidence level equals one minus the Alpha. In other words, an Alpha of 0.05 asks for a 95 percent confidence level, which is what we want here.
    • StandardDeviationThe value returned by the STDEV function in step 3.
    • Size – This is the count of individual test results in our sample. In this example the count is 10.

The CONFIDENCE.NORM function returns a number: 0.51. This tells us that we can be 95% confident that the average response time of all transactions during the studied interval before the upgrade (not just the ones we sampled) is 4.50 seconds ± 0.51 seconds.  In other words, we are 95% confident the average pre-upgrade response time is between 3.99 and 5.01 seconds.

Now, let’s say we calculated the confidence interval for the after-the-upgrade data, and the calculations showed we are 95% confident that the actual average response time of all transactions during the studied interval (not just the ones we sampled) is 4.10 seconds ± 0.49 seconds.

So what does this all mean?  If the confidence intervals overlap, there is no statistically significant improvement. As you can see below, they clearly overlap and, even though the after-the-upgrade response times numbers look better, statistics can offer no guarantee of any real improvement. The upgrade might have helped, but you can’t prove it with the data you have to the level of confidence (95%) you want.

beforeafter

This is the same calculation pollsters’ do when they randomly call ~1000 people and, from that small sample, predict how the nation will vote. When these polls are talked about, they rarely quote the ALPHA or the confidence interval. If they did, the lead story of some future newscast might be:

The latest polls are 95% confident that candidate X is polling at 53% and candidate Y is at 48%. The margin of error is ± 5 points so there is no statistically measurable difference and thus we really have no idea who is winning.

Now you might want to be absolutely 100% sure you are seeing an improvement. Statistics can’t help you here because, to be 100% confident, you need to have response time data from ALL the transactions, not just a sample of them. If you have 100% of the data, you don’t need statistics because you have 100% of the data. For most cases, a confidence level of 95% or 98% will do nicely.

 

A short, occasionally funny, book on how to solve and avoid application and/or computer performance problems

This is all I know about statistics, but you’ll find a lot more about doing computer performance work in my book, which you can find on Amazon and iTunes. There are chapters in the book on:
  – Useful Laws & Things I’ve Found To Be True
  – Performance Monitoring
  – Capacity Planning
  – Load Testing
  – Modeling
  – Presenting Your Results

Confirm What You Are Metering

The You-Are-Hereonly constant in this universe is change. Applications, operating systems, hardware, networks, and configurations can and do change on a regular basis.

It’s easy to start the right meters on the wrong system. It’s easy to miss an upgrade or a configuration change. Before any meter-gathering program settles down into its main metering loop, it should gather some basic data about where it is and what else is there. Gather things like:

  • System name and network address
  • System hardware CPU, memory, disk, etc.
  • Operating System release and configuration info
  • List of processes running
  • Application configuration info

Most of the time this data is ignored, but when weird things happen, or results suddenly stop making sense, this data can be a valuable set of clues as to what changed.