Three Tools You Should Build

Given that it is a good idea to keep an eye on performance all the time, there are lots of companies that only allow you pay periodic attention to performance. They focus on it when there is a problem, or before the annual peak, but the rest of the year they give you other tasks to work on.

toolsThis is a lot like my old job in Professional Services – A customer has a problem, I fly in, find the trouble, and then don’t see them until the next problem crops up.

To do that job I relied on three tools that I created for myself and that you might start building to help you work on periodic performance problems.

Three Tools

List All – The first tool would dig through the system and list all the things that could be known about the system: config options, OS release, IO, network, number of processes, what files were open, etc. The output was useful by itself as now I had looked in every corner of the system and knew what I was working on. Several times it saved me days of work as the customer had initially logged me into the wrong system. It always made my work easier as I had all the data I needed, in one place, conveniently organized, and in a familiar order.

Changes – If I’d been to this customer before, this tool allowed me to compare the state of the system with the previous state. It just read through the output of the List All I’d just done and compared it with the data I collected on my last visit. Boy, was this useful as I could quickly check the customer’s assurance that “Nothing had changed since my last visit.” I remember the shocked look on the customer’s face when I asked: “Why did you downgrade memory?”

Odd Things – Most performance limiting, or availability threatening, behavior is easy to spot. But for any OS, and any application, there are some things that can really hurt performance that you have to dig for in odd places with obscure meters. These are a pain to look for and are rare, so nobody looks for them. Through the years as I discovered each odd thing, I would write a little tool to help me detect the problem and then I’d add that tool to the end of my odd things tool.  I’d run this tool on every customer system I looked at and, on occasion, I would find something that surprised everyone: “You haven’t backed up this system in over a year.” or solved a performance problem by noticing a foolish less than optimal configuration choice.

With most everything happening on servers somewhere in the net/cloud these days, knowing exactly where you are and what you’ve got to work with is important. Being able to quickly gather that data in a matter of minutes allows you to focus on the problem at hand confident that you’ve done a through job.

Stone_SoupAll three of these tools were built slowly over time. Get started with a few simple things. The output of all three is just text – no fancy GUI interface or pretty plots are required. When you have time, write the code to gather the next most useful bit of information and add that to your tool.

Just like the old folk story of stone soup, your tools will get built over time with the contributions of others. Remember to thank each contributor for the gifts they give you and share what you have freely with others.


Other useful hints can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

 

The Five Minute Rule

I used to travel to companies to work on their performance problems. Once I arrived, and we had gone through the initial pleasantries, I would ask them a simple question:

How busy is your system right now?

If the person I asked had a ballpark estimate that they quickly confirmed with a meter, I’d know that whatever problem they called me to solve would be an obscure one that would take some digging to find. If they had no idea, or only knew the answer to a resolution of an entire day, then I was pretty sure that there would be plenty of performance-related ugliness to discover. This little question became the basis for a rule of thumb that I used my entire career.  five

The less a company knows about the work their system did in the last five minutes, the more deeply screwed up they are.

Your job is likely different than mine was, but there is a general truth here. If the staff can easily access data that tells them in the last few minutes how much work is flowing through the system (transactions, page views, etc.) and how the system is reacting to that work (key utilization numbers, queue depths, error counts, etc.), then they have the data to isolate and solve most common performance problems. Awareness is curative. The company will solve more of it own performance problems, see future performance problems coming sooner, and spend less money on outside performance experts.

 


Other rules of thumb can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

When Does The Warmup End?

Unless you are focused on optimizing the time to restart an application, the meters you might collect at application start are pretty useless as things are warming up. The first few transactions will find the disks nearly idle, queues empty, locks unlocked, nothing useful in cache, etc. As the work flows in, queues will build, buffers will fill, and the application will settle down after a short period.warmup

There have been many math geniuses that have spent years trying to find a mathematical way to know when the warm-up period is over and you can start believing the data. To date there is no good mathematical answer. You just have to eyeball it. Fire up the application and let it run. Your eye will clearly note if, and when, the response times and utilizations stabilize. Ignore the data during the warm-up period.


Other helpful hints can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

The Sample Length of The Meter

Any meter that gives you an averaged value has to average the results over a period of time. If you don’t precisely understand that averaging, then you can get into a lot of trouble.

The two graphs below show exactly the same data with the only difference being the sample length of the meter. In the chart below the data was averaged every minute. Notice the very impressive spike in utilization in the middle of the graph. During this spike this resource had little left to give.dailypeak1

In the chart below the same data was averaged every 10-minutes. Notice that the spike almost disappears as the samples were taken at such times that part of the spike was averaged into different samples. Adjusting the sample length can dramatically change the story.dailypeak2

Some meters just report a count, and you’ve got to know when that count gets reset to zero or rolls over because the value is too big for the variable to hold. Some values start incrementing at system boot, some at process birth.

Some meters calculate the average periodically on their own schedule, and you just sample the current results when you ask for the data. For example, a key utilization meter is calculated once every 60 seconds and, no matter what is going on, the system reports exactly the same utilization figure for the entire 60 seconds. This may sound like a picky detail to you now, but when you need to understand what’s happening in the first 30 seconds of market open, these little details matter.

Below you will see a big difference in the data you collect depending on how you collect and average it.  In the one-second average (red line) you are buried in data. In the one-minute average (sampled in the yellow area) you missed a significant and sustained peak because of when you sampled. The 10-minute average (sampled in the green area) will also look reassuringly low because it averages the peaks and the valleys.avg3

Take the time, when you have the time, to understand exactly when the meters are collected and what period they are averaged over. The best way to do that is to meter a mostly idle system and then use a little program to bring a load onto the system for a very precise amount of time and see what the meters report. The better you understand your tools, the more precisely and powerfully you can use them.


This hint and many others are in: The Every Computer Performance Book which is available at AmazonPowell’s Books, and on iTunes.


 

Metering Deeply To Answer Questions About The Future

You build a performance model to answer a question that capacity planning or load testing can’t answer.  Since you can’t test the future, you have to meter what you’ve got in clever ways and then mathematically predict for the future reality.

Performance Model Questions

The UltraBogus 3000 is a mythical state-of-the art computer of yesteryear whose total downtime has never been exceeded by any competitor.  It features fully-puffed marketing literature, a backwards-compatible front door, and a Stooge-enabled, three-idiot architecture that processes transactions with a minimum of efficiency.  Its two-bit bus runs conveniently between your office and the repair facility every Tuesday.

Let’s start with a couple of performance model questions: “Will upgrading to the new UltraBogus 3000 computer (pictured above) solve our performance problem?” or “If we move just the X-type transactions to the XYZ system, will it be able to handle the load?” You can’t meter or load test a computer you don’t have and the performance meters typically show the net result of all transactions, not just X-type transactions.

To answer these kind of questions, you have to know the picky detail of how work flows through your system and what resources are consumed when doing a specific type of transaction. Here are a couple of ways I’ve done this.

Finding Specific Transaction Resource Requirements

You can use load testing to explore the resource consumption and transaction path of a given transaction, by gently loading the system with that kind of transaction during a relatively quiet time. Send enough transactions so the load is clear in the meters, but not so many as the response time start degrading. Any process or resource that shows a dramatic jump in activity is part of the transaction path. Imagine you got this data on a key process before, during, and after a five-minute load test on just transaction X:

txgraph

Now do a bit of simple math to figure out the resource consumption for a given X transaction. For reads it looks as though the number per second increased by about 100/sec during the test, so it’s reasonable to estimate that for every X transaction this key process does one read. The writes didn’t really change, so no writes are done for the X transaction. First notice if the before and after consumption of resources is about the same. If so, it is a good sign that during the load test there was no big change in the user-supplied load. I’d recommend you repeat this test a couple of times. If you get about the same results each time, your confidence will grow, and the possibility of error is reduced.

For CPU, the consumption during the load test jumped by about 600 milliseconds per second and therefore each X transaction requires 600 / 100 = 6ms of CPU.  Overall, this test tells us that the X transaction passes through this key process, does one read, does no writes, and burns six milliseconds of CPU.

Please note, the math will never be precise due to sampling errors and random variations. This methodology can give you a value that is about right. If you need a value that is exactly right, then you need to run this test on an otherwise idle system, perhaps at the end of a scheduled downtime.

Finding The Transaction Path

You can use a network sniffer to examine the back and forth network traffic to record the exact nature of the interprocess communication. For political, technical, and often legal reasons, this is usually only possible on a test system. Send one transaction in at a time separated by several seconds, or with a ping, so there is no overlap and each unique interaction is clear.. Study multiple examples of each transaction type.

The following timeline was created from the network sniffer data that showed the time, origin, and destination of each packet as these processes talked to each other. During this time there were two instances of the X transaction, with a ping added so the transaction boundaries were completely clear. We can learn many things from this timeline.

trace

First we know the transaction path for the X transaction (A to B to C to B to A), which can be very useful. Please note, a sniffer will only show you communication over the network. It won’t show all the dependencies with other processes and system resources that happen through other interprocess communication facilities. This give you a good place to start, not the whole map of every dependency.

We can also see exactly how long each process works on each thing it receives. Process C worked on each X transaction for about 200 milliseconds before responding. Little’s law (and common sense) show if you know the service time you can find the maximum throughput because:

MaxThroughput  ≤  1 / AverageServiceTime

Since this was an idle system, the wait time was zero, so the response time was the service time. If we do the calculation and find that 1/.2 = 5, then we know that Process C can only handle five X transactions per second.

If the question for the model was: “If we don’t reengineer Process C, can we get to 500 X transactions/sec?” The answer is no. If the question for the model was: “How many copies of Process C do we need to handle 500 X transactions a second?” The answer is going to be at least 500/5 = 100, plus more because you never want to plan for anything to be at, or near, 100% busy at peak.

When it’s time to build a model, often you’ll need to know things that can be difficult to discover. Share the problem widely as others might have a clue, a meter, or a spark of inspiration that will help you get the data. Work to understand exactly what question the model is trying to solve as that will give you insight as to what alternative data or alternative approach might work.

While you’re exploring also keep an eye on the big picture as well. Any model of your computing world has to deal with the odd things that happen once in a while. Note things like backups, end of day/week/month processing, fail-over and disaster recovery plans.

This post was built on an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.

 

Hints On Metering For Capacity Planning

Capacity Planning is projecting your current situation into a somewhat busier future. To do this work you need to know what is being consumed under a relatively stable load, and then you scale that number to the projected load. So if some component of your computing world is 60% busy under the current load, it most likely will be 120% busy (and thus a serious bottleneck) under 2X the current load.

normal

To meter for capacity planning you should collect data from every part of your computing world that shows:

  • The current workload (transaction rate)
  • How busy it is (utilization)
  • How full it is (size, configuration or capacity limits)

Essentially you are looking to see how close you are to hitting any limit to growth, no matter how insignificant.

If there are well understood and closely watched workload meters for your system (e.g., “Currently we are handling 5,500 transactions per minute”) then collect them. You can do capacity planning without a workload meter as, in most cases I’ve worked on, the boss asked for a general scaling of the current load like: “Can we can handle twice our daily peak load?”

It is best to meter when the system is under a stable load, one where the load is not changing rapidly, because then you can get several samples to be sure you are not seeing some odd things that are not connected to the overall load. Below we see some samples where the load is stable over time, but there are things to notice here.

samples

Ignoring the sample at 12:15 for the moment, notice that the overall the load is stable. It will never be perfectly stable where each sample gives you exactly the same numbers. Some variation will always happen. Values plus or minus 10% are fine.

If the overall load is never stable, then pick some metered value, like X TX/min, and look at all your samples for any sample you collect that shows X TX/min (± 10%) and see if the other numbers you are tracking, like disk busy and CPU, are stable in relation to it.

In the 12:15 sample the CPU usage essentially doubles even though the disk busy and the TX/min number are stable. This is either some oddball thing that happened and is not normal, or perhaps at 12:15 every day this happens, or perhaps this happens 10 times a day at somewhat random times. Either set yourself to solving this mystery, or if you are sure this is a non-repeating event, ignore the 12:15 data.

This is why you just don’t meter for 20 minutes on one day when doing capacity planning. You’ll look at lots of data, over many days, and try to come up with a reasonable estimate of how busy everything is under a given load. If there are days where unique demands, not related to load, are placed on your computing world, be sure to meter through those times, too. The whole reason for capacity planning is typically to ensure you have enough resources to handle a peak load, even if it arrives on a day when the system has other things to do, too.

A common mistake people make is to look at the CPU consumption of a process and use that to estimate how much more work it could handle. They might notice that a process is only consuming six seconds of CPU per minute and thus deduce that the process is only 10% busy (60/6 =0.10) but that’s usually wrong. Processes do other things besides burn CPU. They wait for disk IO, they wait for other processes to return an answer they need, they wait for resource locks, etc. A process can be doing all the work it can and still burn only a small fraction of the CPU it could if it were just crunching numbers. Gather the per process CPU consumption, but don’t mistake it as a straight-forward way to gauge how busy a process is.

When metering for capacity planning, gather any data you can for limits that can be hit. Notice things that kill performance, or sometimes whole applications, when they get full, empty, or reach some magic number. Any limit you hit, will hurt.

caution7feet

Some examples of limits your computing world might hit: disk space, available memory, paging space, licensing limits on the application, number of server processes, maximum queue depth, networking port limits, database limits.

Also consider the limits that per-user or per-directory security and quotas bring to the party. Look for limits everywhere and record how close you are to them.

This is an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.

How To Meter a Short Duration Problem

stopwatchSome performance problems come and go in a minute or two. Depending on the industry, the company goals, and the expectations of the users, these problems are either a big deal or ignored with a yawn.

For short duration performance problems where you know when they will start (market open, 10pm backup, etc.) here are some tips for setting up special metering to catch them:

  • Start your meters well before the problem happens. Have them run a few times to be sure they are working as expected and have them just sleep until 15 minutes before the problem starts.
  • Meter at a frequency that is at least ¼ of the expected duration of the event – this gives you multiple samples during the event.
  • Let the meters run for 15 minutes after the problem is usually over.
  • Now you have meters collected before, during, and after the event. Compare and contrast them looking for what changed and what stayed the same during the event.

It is quite common for people to be suspicious that the new metering you are running is making the problem much worse. That’s why it is a very good idea to have it running well before the anticipated start of the problem. There is some cause for this suspicion as a small typo can turn a “once per minute” metering macro into a “fast as you can” metering macro that burns CPU, fills disks, and locks data structures at a highly disruptive rate. Like a physician, your primary goal should be: First, do no harm. It is always a good idea to test your meters at a non-critical time and (if possible) to meter the meters so you can show the resource usage of your meters.

If the problem happens without warning, then, if possible, identify something or some event that usually precedes the problem that you can “trigger on” to start the intensive metering. A trigger might be when the queue has over X things in it, or when something fails, or when something restarts, etc. Finding the trigger can sometime be frustrating, as correlation does not always mean causality. Keep searching the logs and any other meters you have.

Sometimes, all you have to go on is that it “happens in the morning” or “mostly on Mondays.” Work with what you’ve got and meter during those times.

If the problem has no known trigger and seems to happen randomly, you’ll have to intensively meter for it until it happens again. This will burn some system resources and give you a mountain of data to wade through. If this is a serious problem, then buckle up and do the work.

Errors Are Really Interesting Too

Errors, even little ones, can be performance killers. Collect every meter you can that tracks errors and other unfortunate events. Over time investigate what they are telling you.

  • What is the nature of this problem
  • What causes this problem?
  • Why is this problem happening?
  • How does the system work around (or suffer because of) this problem?

XKCDThe errors that most affect response time and throughput tend to be “timeout” errors, where something waited and waited and finally gave up. Big problems with timeout errors tend to show up as suspiciously low utilization. There is work waiting, but key resources are less busy than normal.

Some errors are unavoidable. You will always see a few of them in the data. The key is to know what’s normal. When monitoring errors, notice when there are a lot more errors than usual for a given transaction rate. Investigate that.

errors

In the above graph the transaction rate is fairly steady, but just after 11:15 the error rate takes off. Don’t panic. You need to keep a sense of scale here. At 12:00 there are about 3500 transactions per minute and just a little less than 80 errors per minute. So we are seeing approximately one error for every 43 transactions. You should still investigate this, especially if the response time increased at the same time. Given the low number of errors per transaction, this error is unlikely to be the cause of an overall response time problem, but it might be an interesting clue as to what’s going on.

Meter Response Time From The Inside Out

The whole reason companies build applications is to handle the work with a reasonable response time. To do that well you need to monitor both internal and external response time for the transactions you care about.

waiting

There are no response time meters in the systems, or other technology, that your computing world is built out of because the builders of that equipment don’t know how you define a transaction. You are going to have to select the transactions you care about (eg: the SHOP transaction is 98% of your computing workload, the BUY transaction brings in 100% of the money) and find response time meters for yourself. To do that, you need to meter both internal and external response times.

Meter Internal Response Time 

Usually, you only have control over some of the computers that the transactions you process flow though. In the example below, system B is your responsibility and you are very interested in how responsive it is.

AB

Understanding the internal response time (when A gives you work, how rapidly do you deliver the response) helps you in three ways by giving you an alibi, an insight, and a head start.

  1. Alibi: When users are having response time problems, if you can show the response times are fine within your world, then the problem lies elsewhere.
  2. Insight: If you see response times increasing internally, but there is nothing very interesting showing up in any of the performance meters you currently collect, then you have an undiscovered problem and you need to do some more exploring.
  3. Head start: Sometimes response time problems take a while (hours, days, or even weeks) to be noticed and reported back to you or your boss. If you are paying attention to your internal response time meters, and the problem is located in your world, then you can be already working to find and fix the root cause of the problem when the boss knocks on your door.

Meter End User Response Time

It is also a good thing to test the response time as close to the user as possible. In this day and age that usually means testing outside your company across the Internet.

ABNet

When testing across the Internet, the first thing you need to realize is that distance matters. If your users are spread out geographically, then those the farthest away will have the worse response times. The speed of light is not infinity, and the more distance you put between you and the customer the more delay they will experience.

You might reasonably point out that the few extra milliseconds don’t matter on a human scale, but you need to remember that, at many levels, there are back and forth conversations going on like so:

   Can I have GrandCanyon.jpg? >—>
                    <—< Here is part #1, let me know when you get it.
   Thanks, I got part #1. >—>
                     <—< Here is part #2, let me know when you get it.
   Thanks, I got part #2. >—>

Each back and forth pays the price of geographically induced delays. There was an interesting experiment the ACM did in 2009 where they repeatedly copied four gigabytes of data across the Internet, and the only thing they varied was the distance between the source and the destination computers. Below are their results that clearly show distance matters.

table

In testing response time close to the user, you also need to pay attention to using last mile connections that resemble what the users are using. There is a big difference in throughput, network latency, and error rates among dial-up, DSL, satellite, cable modems, fiber optic, and of course mobile cellular connections. You should look at your current user base and test your response time using those networks.

A small change in the amount of data you are sending to the user can have a big impact on response time if their network has a restricted throughput. For example, a user with a 40kilobit/second dial-up connection will see an additional second of response time when the amount of data sent increases by just 5000 bytes. Depending on the current conditions, mobile connections can also have surprisingly bad throughput and high error rates, which further slows things down.

Sometimes the big size increase of what you are delivering to the users happens to make some internal group happy, and no one notices as they all download the bits over the wicked-fast, low-latency corporate net. Then the change is rolled out to the general pubic and suffering ensues.

kidsThe Internet is a network of networks that are owned by different companies that sometimes act like petulant little children who refuse to play nice with each other. Most of that trouble is intermittent, of short duration, and is completely outside of your control, but you do select a network when you choose your company’s ISP.

It can be a good thing to test response time using multiple ISPs from a given key geography. Imagine your customer service department starts getting complaints, but your internal meters all look good. Then you notice your response time tests that connect to the Internet via the Level3 network are all having troubles, but the tests connecting through ISP’s using network AT&T and MCI are all doing fine. Clearly, this is not your problem to fix, but it is wildly useful to know what is happening.

At this point you are probably thinking that all this testing sounds impossible to set up. Fear not, there are companies that have vast arrays of test machines all over the world and do this testing and provide detailed results to you as their business. When selecting a company to do this testing be sure that they can test:

  1. From where your users are located
  2. The types of last mile connections your users use
  3. The major ISPs your users use
  4. From inside your company (extra credit)

If you are thinking I can’t do all this, please stop worrying. Nobody does all of this. As Teddy Roosevelt once said: “Do what you can, with what you have, where you are.”

Theodore-Roosevelt-1024x768-1

There is more to learn about performance work in this blog and in my book
The Every Computer Performance Book, which you can find on Amazon and iTunes.

Confirm What You Are Metering

The You-Are-Hereonly constant in this universe is change. Applications, operating systems, hardware, networks, and configurations can and do change on a regular basis.

It’s easy to start the right meters on the wrong system. It’s easy to miss an upgrade or a configuration change. Before any meter-gathering program settles down into its main metering loop, it should gather some basic data about where it is and what else is there. Gather things like:

  • System name and network address
  • System hardware CPU, memory, disk, etc.
  • Operating System release and configuration info
  • List of processes running
  • Application configuration info

Most of the time this data is ignored, but when weird things happen, or results suddenly stop making sense, this data can be a valuable set of clues as to what changed.