Three Tools You Should Build

Given that it is a good idea to keep an eye on performance all the time, there are lots of companies that only allow you pay periodic attention to performance. They focus on it when there is a problem, or before the annual peak, but the rest of the year they give you other tasks to work on.

toolsThis is a lot like my old job in Professional Services – A customer has a problem, I fly in, find the trouble, and then don’t see them until the next problem crops up.

To do that job I relied on three tools that I created for myself and that you might start building to help you work on periodic performance problems.

Three Tools

List All – The first tool would dig through the system and list all the things that could be known about the system: config options, OS release, IO, network, number of processes, what files were open, etc. The output was useful by itself as now I had looked in every corner of the system and knew what I was working on. Several times it saved me days of work as the customer had initially logged me into the wrong system. It always made my work easier as I had all the data I needed, in one place, conveniently organized, and in a familiar order.

Changes – If I’d been to this customer before, this tool allowed me to compare the state of the system with the previous state. It just read through the output of the List All I’d just done and compared it with the data I collected on my last visit. Boy, was this useful as I could quickly check the customer’s assurance that “Nothing had changed since my last visit.” I remember the shocked look on the customer’s face when I asked: “Why did you downgrade memory?”

Odd Things – Most performance limiting, or availability threatening, behavior is easy to spot. But for any OS, and any application, there are some things that can really hurt performance that you have to dig for in odd places with obscure meters. These are a pain to look for and are rare, so nobody looks for them. Through the years as I discovered each odd thing, I would write a little tool to help me detect the problem and then I’d add that tool to the end of my odd things tool.  I’d run this tool on every customer system I looked at and, on occasion, I would find something that surprised everyone: “You haven’t backed up this system in over a year.” or solved a performance problem by noticing a foolish less than optimal configuration choice.

With most everything happening on servers somewhere in the net/cloud these days, knowing exactly where you are and what you’ve got to work with is important. Being able to quickly gather that data in a matter of minutes allows you to focus on the problem at hand confident that you’ve done a through job.

Stone_SoupAll three of these tools were built slowly over time. Get started with a few simple things. The output of all three is just text – no fancy GUI interface or pretty plots are required. When you have time, write the code to gather the next most useful bit of information and add that to your tool.

Just like the old folk story of stone soup, your tools will get built over time with the contributions of others. Remember to thank each contributor for the gifts they give you and share what you have freely with others.


Other useful hints can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

 

The Five Minute Rule

I used to travel to companies to work on their performance problems. Once I arrived, and we had gone through the initial pleasantries, I would ask them a simple question:

How busy is your system right now?

If the person I asked had a ballpark estimate that they quickly confirmed with a meter, I’d know that whatever problem they called me to solve would be an obscure one that would take some digging to find. If they had no idea, or only knew the answer to a resolution of an entire day, then I was pretty sure that there would be plenty of performance-related ugliness to discover. This little question became the basis for a rule of thumb that I used my entire career.  five

The less a company knows about the work their system did in the last five minutes, the more deeply screwed up they are.

Your job is likely different than mine was, but there is a general truth here. If the staff can easily access data that tells them in the last few minutes how much work is flowing through the system (transactions, page views, etc.) and how the system is reacting to that work (key utilization numbers, queue depths, error counts, etc.), then they have the data to isolate and solve most common performance problems. Awareness is curative. The company will solve more of it own performance problems, see future performance problems coming sooner, and spend less money on outside performance experts.

 


Other rules of thumb can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

When Does The Warmup End?

Unless you are focused on optimizing the time to restart an application, the meters you might collect at application start are pretty useless as things are warming up. The first few transactions will find the disks nearly idle, queues empty, locks unlocked, nothing useful in cache, etc. As the work flows in, queues will build, buffers will fill, and the application will settle down after a short period.warmup

There have been many math geniuses that have spent years trying to find a mathematical way to know when the warm-up period is over and you can start believing the data. To date there is no good mathematical answer. You just have to eyeball it. Fire up the application and let it run. Your eye will clearly note if, and when, the response times and utilizations stabilize. Ignore the data during the warm-up period.


Other helpful hints can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

The Sample Length of The Meter

Any meter that gives you an averaged value has to average the results over a period of time. If you don’t precisely understand that averaging, then you can get into a lot of trouble.

The two graphs below show exactly the same data with the only difference being the sample length of the meter. In the chart below the data was averaged every minute. Notice the very impressive spike in utilization in the middle of the graph. During this spike this resource had little left to give.dailypeak1

In the chart below the same data was averaged every 10-minutes. Notice that the spike almost disappears as the samples were taken at such times that part of the spike was averaged into different samples. Adjusting the sample length can dramatically change the story.dailypeak2

Some meters just report a count, and you’ve got to know when that count gets reset to zero or rolls over because the value is too big for the variable to hold. Some values start incrementing at system boot, some at process birth.

Some meters calculate the average periodically on their own schedule, and you just sample the current results when you ask for the data. For example, a key utilization meter is calculated once every 60 seconds and, no matter what is going on, the system reports exactly the same utilization figure for the entire 60 seconds. This may sound like a picky detail to you now, but when you need to understand what’s happening in the first 30 seconds of market open, these little details matter.

Below you will see a big difference in the data you collect depending on how you collect and average it.  In the one-second average (red line) you are buried in data. In the one-minute average (sampled in the yellow area) you missed a significant and sustained peak because of when you sampled. The 10-minute average (sampled in the green area) will also look reassuringly low because it averages the peaks and the valleys.avg3

Take the time, when you have the time, to understand exactly when the meters are collected and what period they are averaged over. The best way to do that is to meter a mostly idle system and then use a little program to bring a load onto the system for a very precise amount of time and see what the meters report. The better you understand your tools, the more precisely and powerfully you can use them.


This hint and many others are in: The Every Computer Performance Book which is available at AmazonPowell’s Books, and on iTunes.


 

Metering Deeply To Answer Questions About The Future

You build a performance model to answer a question that capacity planning or load testing can’t answer.  Since you can’t test the future, you have to meter what you’ve got in clever ways and then mathematically predict for the future reality.

Performance Model Questions

The UltraBogus 3000 is a mythical state-of-the art computer of yesteryear whose total downtime has never been exceeded by any competitor.  It features fully-puffed marketing literature, a backwards-compatible front door, and a Stooge-enabled, three-idiot architecture that processes transactions with a minimum of efficiency.  Its two-bit bus runs conveniently between your office and the repair facility every Tuesday.

Let’s start with a couple of performance model questions: “Will upgrading to the new UltraBogus 3000 computer (pictured above) solve our performance problem?” or “If we move just the X-type transactions to the XYZ system, will it be able to handle the load?” You can’t meter or load test a computer you don’t have and the performance meters typically show the net result of all transactions, not just X-type transactions.

To answer these kind of questions, you have to know the picky detail of how work flows through your system and what resources are consumed when doing a specific type of transaction. Here are a couple of ways I’ve done this.

Finding Specific Transaction Resource Requirements

You can use load testing to explore the resource consumption and transaction path of a given transaction, by gently loading the system with that kind of transaction during a relatively quiet time. Send enough transactions so the load is clear in the meters, but not so many as the response time start degrading. Any process or resource that shows a dramatic jump in activity is part of the transaction path. Imagine you got this data on a key process before, during, and after a five-minute load test on just transaction X:

txgraph

Now do a bit of simple math to figure out the resource consumption for a given X transaction. For reads it looks as though the number per second increased by about 100/sec during the test, so it’s reasonable to estimate that for every X transaction this key process does one read. The writes didn’t really change, so no writes are done for the X transaction. First notice if the before and after consumption of resources is about the same. If so, it is a good sign that during the load test there was no big change in the user-supplied load. I’d recommend you repeat this test a couple of times. If you get about the same results each time, your confidence will grow, and the possibility of error is reduced.

For CPU, the consumption during the load test jumped by about 600 milliseconds per second and therefore each X transaction requires 600 / 100 = 6ms of CPU.  Overall, this test tells us that the X transaction passes through this key process, does one read, does no writes, and burns six milliseconds of CPU.

Please note, the math will never be precise due to sampling errors and random variations. This methodology can give you a value that is about right. If you need a value that is exactly right, then you need to run this test on an otherwise idle system, perhaps at the end of a scheduled downtime.

Finding The Transaction Path

You can use a network sniffer to examine the back and forth network traffic to record the exact nature of the interprocess communication. For political, technical, and often legal reasons, this is usually only possible on a test system. Send one transaction in at a time separated by several seconds, or with a ping, so there is no overlap and each unique interaction is clear.. Study multiple examples of each transaction type.

The following timeline was created from the network sniffer data that showed the time, origin, and destination of each packet as these processes talked to each other. During this time there were two instances of the X transaction, with a ping added so the transaction boundaries were completely clear. We can learn many things from this timeline.

trace

First we know the transaction path for the X transaction (A to B to C to B to A), which can be very useful. Please note, a sniffer will only show you communication over the network. It won’t show all the dependencies with other processes and system resources that happen through other interprocess communication facilities. This give you a good place to start, not the whole map of every dependency.

We can also see exactly how long each process works on each thing it receives. Process C worked on each X transaction for about 200 milliseconds before responding. Little’s law (and common sense) show if you know the service time you can find the maximum throughput because:

MaxThroughput  ≤  1 / AverageServiceTime

Since this was an idle system, the wait time was zero, so the response time was the service time. If we do the calculation and find that 1/.2 = 5, then we know that Process C can only handle five X transactions per second.

If the question for the model was: “If we don’t reengineer Process C, can we get to 500 X transactions/sec?” The answer is no. If the question for the model was: “How many copies of Process C do we need to handle 500 X transactions a second?” The answer is going to be at least 500/5 = 100, plus more because you never want to plan for anything to be at, or near, 100% busy at peak.

When it’s time to build a model, often you’ll need to know things that can be difficult to discover. Share the problem widely as others might have a clue, a meter, or a spark of inspiration that will help you get the data. Work to understand exactly what question the model is trying to solve as that will give you insight as to what alternative data or alternative approach might work.

While you’re exploring also keep an eye on the big picture as well. Any model of your computing world has to deal with the odd things that happen once in a while. Note things like backups, end of day/week/month processing, fail-over and disaster recovery plans.

This post was built on an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.

 

Hints On Metering For Capacity Planning

Capacity Planning is projecting your current situation into a somewhat busier future. To do this work you need to know what is being consumed under a relatively stable load, and then you scale that number to the projected load. So if some component of your computing world is 60% busy under the current load, it most likely will be 120% busy (and thus a serious bottleneck) under 2X the current load.

normal

To meter for capacity planning you should collect data from every part of your computing world that shows:

  • The current workload (transaction rate)
  • How busy it is (utilization)
  • How full it is (size, configuration or capacity limits)

Essentially you are looking to see how close you are to hitting any limit to growth, no matter how insignificant.

If there are well understood and closely watched workload meters for your system (e.g., “Currently we are handling 5,500 transactions per minute”) then collect them. You can do capacity planning without a workload meter as, in most cases I’ve worked on, the boss asked for a general scaling of the current load like: “Can we can handle twice our daily peak load?”

It is best to meter when the system is under a stable load, one where the load is not changing rapidly, because then you can get several samples to be sure you are not seeing some odd things that are not connected to the overall load. Below we see some samples where the load is stable over time, but there are things to notice here.

samples

Ignoring the sample at 12:15 for the moment, notice that the overall the load is stable. It will never be perfectly stable where each sample gives you exactly the same numbers. Some variation will always happen. Values plus or minus 10% are fine.

If the overall load is never stable, then pick some metered value, like X TX/min, and look at all your samples for any sample you collect that shows X TX/min (± 10%) and see if the other numbers you are tracking, like disk busy and CPU, are stable in relation to it.

In the 12:15 sample the CPU usage essentially doubles even though the disk busy and the TX/min number are stable. This is either some oddball thing that happened and is not normal, or perhaps at 12:15 every day this happens, or perhaps this happens 10 times a day at somewhat random times. Either set yourself to solving this mystery, or if you are sure this is a non-repeating event, ignore the 12:15 data.

This is why you just don’t meter for 20 minutes on one day when doing capacity planning. You’ll look at lots of data, over many days, and try to come up with a reasonable estimate of how busy everything is under a given load. If there are days where unique demands, not related to load, are placed on your computing world, be sure to meter through those times, too. The whole reason for capacity planning is typically to ensure you have enough resources to handle a peak load, even if it arrives on a day when the system has other things to do, too.

A common mistake people make is to look at the CPU consumption of a process and use that to estimate how much more work it could handle. They might notice that a process is only consuming six seconds of CPU per minute and thus deduce that the process is only 10% busy (60/6 =0.10) but that’s usually wrong. Processes do other things besides burn CPU. They wait for disk IO, they wait for other processes to return an answer they need, they wait for resource locks, etc. A process can be doing all the work it can and still burn only a small fraction of the CPU it could if it were just crunching numbers. Gather the per process CPU consumption, but don’t mistake it as a straight-forward way to gauge how busy a process is.

When metering for capacity planning, gather any data you can for limits that can be hit. Notice things that kill performance, or sometimes whole applications, when they get full, empty, or reach some magic number. Any limit you hit, will hurt.

caution7feet

Some examples of limits your computing world might hit: disk space, available memory, paging space, licensing limits on the application, number of server processes, maximum queue depth, networking port limits, database limits.

Also consider the limits that per-user or per-directory security and quotas bring to the party. Look for limits everywhere and record how close you are to them.

This is an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.

How To Meter a Short Duration Problem

stopwatchSome performance problems come and go in a minute or two. Depending on the industry, the company goals, and the expectations of the users, these problems are either a big deal or ignored with a yawn.

For short duration performance problems where you know when they will start (market open, 10pm backup, etc.) here are some tips for setting up special metering to catch them:

  • Start your meters well before the problem happens. Have them run a few times to be sure they are working as expected and have them just sleep until 15 minutes before the problem starts.
  • Meter at a frequency that is at least ¼ of the expected duration of the event – this gives you multiple samples during the event.
  • Let the meters run for 15 minutes after the problem is usually over.
  • Now you have meters collected before, during, and after the event. Compare and contrast them looking for what changed and what stayed the same during the event.

It is quite common for people to be suspicious that the new metering you are running is making the problem much worse. That’s why it is a very good idea to have it running well before the anticipated start of the problem. There is some cause for this suspicion as a small typo can turn a “once per minute” metering macro into a “fast as you can” metering macro that burns CPU, fills disks, and locks data structures at a highly disruptive rate. Like a physician, your primary goal should be: First, do no harm. It is always a good idea to test your meters at a non-critical time and (if possible) to meter the meters so you can show the resource usage of your meters.

If the problem happens without warning, then, if possible, identify something or some event that usually precedes the problem that you can “trigger on” to start the intensive metering. A trigger might be when the queue has over X things in it, or when something fails, or when something restarts, etc. Finding the trigger can sometime be frustrating, as correlation does not always mean causality. Keep searching the logs and any other meters you have.

Sometimes, all you have to go on is that it “happens in the morning” or “mostly on Mondays.” Work with what you’ve got and meter during those times.

If the problem has no known trigger and seems to happen randomly, you’ll have to intensively meter for it until it happens again. This will burn some system resources and give you a mountain of data to wade through. If this is a serious problem, then buckle up and do the work.