Capacity Models For New Machines

Your computing world is changing. To handle the next projected peak, your sales team has suggested you upgrade your computer to the new model, which they claim is three times faster than your old computer. How do you model a machine you have no experience with?Norwich

Regardless of the sincerity of the sales team and their dedication to truth, the claim that the new machine is three times faster is wrong in the same fundamental way that assuming your SAT scores predict your ability to write a best selling novel. Every time you switch hardware, some parts are faster, some hardly change at all, and occasionally, some parts can run more slowly. It’s how your unique application uses those parts that will determine how much more work the new system will handle.

So where to begin?

Start with the simple things. Do all the calculations that you can do simply and easily first. If they work out, then move on to the more detailed and complex work. If they don’t work out, then you have to rethink your answer, and you’ve just saved yourself the time you would have wasted on detailed analysis.

For example, that three times faster number they gave you is usually heavily weighted toward the CPU performance.  So capacity plan your current system for your peak load and check to see if it will “fit” into your new system.

Let’s say the next projected seasonal peak is 5X busier than a moderately busy day you metered recently. On that day the system was about 30% busy. Do the math (5* 30% = 150%) to see that your old system would be 150% CPU busy at peak.  The new machine is 3X as fast as your old machine, and it only has to be 1.5x faster to (barely) handle the load. Chances are you are good to go CPU wise.

If the numbers had been uncomfortably close (e.g., the new machine was 1.7X faster than the old one), then more testing and checking would be in order. Remember, the closer you are to the edge of a performance cliff, the more precisely you have to know your position to stay safe. If it looks like the device in question is going to be over 50% busy, consult this post on queuing theory effects to get a rough estimate of the response time penalty you will pay.

Now, dig though each part of the machine to make sure this upgrade will do the job.  Do one thing at a time. Take good notes. Write your capacity report as you go.

The Hard Truth About Scaling

For any computer, application, or process you’re ever likely to encounter, the following describes the transaction path:

  1. Bits go in.
  2. Bits are transformed by the CPU.
  3. You may have to wait as bits are sent to, or requested from, local storage or some other computer.
  4. Bits go out.

It is in step three that your dreams of magical performance increases and simple scaling go to die. See Amdahl’s Law and Liebig’s Law Of The Minimum. Compared to pushing bits around in memory, waiting for data requested from local storage and other computers is tremendously slow. Also, when you upgrade a system, the time to fetch bits from local storage, or another computer, rarely keeps up with the overall speed increase the sales team promised you.

For example, if a process needs to read one record from disk for every transaction, then that IO is may be the biggest throughput limit. Even when you upgrade to a faster CPU, the disk runs at about the same speed, and so the transaction duration does not scale well as you see below.waiting

So, to handle 5X the load with your new machine you may need to add more processes. Any given process can only do so many transactions per second, and that number may not scale up to match the overall speed increase the salesperson claimed for the reasons outlined above. Let’s work through an example.

A Trick For Estimating Process Throughput

Many of the applications I’ve worked with had the ability to dynamically add new processes if incoming workload required it. A trick I’ve used to find the maximum throughput of a process is to start fewer than are normally required and then wait for the user workload to build as the day progresses. I’d watch closely for signs that the transactions are backing up and, when I felt I’d hit the maximum throughput, then I’d start the regular number of processes. Lastly, I do a bit of simple math to calculate the throughput of a process.

  • With 2 processes I hit max throughput at 100 TX/sec. That gives me: 100 / 2 = 50 TX/sec per process and I know that each transaction takes about 20ms total time as 1000ms / 50 = 20ms
  • During testing each process used ~150ms/sec of CPU. That gives me: 150ms/50TX= 3ms CPU/transaction
  • The CPU on the new machine is three times as fast. That gives me: 3ms / 3 = 1ms of CPU estimated per transaction on the new machine
  • Each transaction on the new machine will spend 2ms less time computing so the average transaction time will be 20ms – 2ms = 18ms
  • So that works out to a max throughput of ~55 transactions per second as 1000ms / 18ms = 55.5555

It can be tempting to display lots of decimal places in the numbers you come up with as that gives the illusion of precision. However, the numbers you started with are typically not all that precise. Furthermore, if the future of your company hangs on a tenth of a transaction per second then you are cutting it way too close for anyone’s comfort.

 

 

So, on the old machine, each process could handle 50TX/sec, and on the new machine each process can theoretically handle 55TX/sec. Now you see why you’ll need more processes to handle the load even though the machine is much faster.

Communications

Just like waiting for bits from local storage, waiting for bits from another computer can take up a big chunk of the overall transaction response time.commwait

You can do the same basic trick that we just did with local storage to find the max throughput of a given key process. When doing this work, make sure to lookout for comm errors.  You can’t eliminate all comm errors, especially if the Internet is involved, but keep an eye on them as you are gathering your data.  If there seems to be a significant increase in comm errors while you are gathering your data, that can have a big affect on throughput.

Look at the communications capacity to see if it can handle the projected peak load, which is 5X the traffic that your old system handled on a moderately busy day. Also be sure there is room for this increased traffic on whatever parts of the corporate network these packets flow through.

Local Storage

At the time of this writing, local storage, typically rotating magnetic disks, are the slowest part of any system and the most likely thing to bottleneck.disk

When upgrading, think of each disk not only as storage space, but as a device that can only give you a finite number of IO’s per second.  A new 2TB disk that can perform 200 IOs/sec is not the same as four older 500MB disks that can each perform 150 IOs/sec, but together the four older disks can perform 600 IO/sec.

A single process waiting for disk IO will notice a speed improvement when it is using the faster 2TB disk, but all the processes doing IO will overwhelm the 2TB disk long before they overwhelm the four 500MB disks.

When moving files to the disks of the new system, remember that the size of the file tells you nothing about how many IO’s per second the application does to it.  If your operating system gives you per-file IO data, then use it to balance the IO load among your disks. If there are no per-file meters, then you need to have a chat with the programmers and take your best guess as to how much IO is going to each file. Once you decide on a plan, move only a small number of files at a time and see how each move goes.

The other thing to consider when balancing disk IO is the IO load of periodically scheduled jobs like backups, overnight processing, end-of-month reports, etc. I’ve seen systems that were rebalanced nicely for production that became a nightmare when these background jobs were run. These background jobs don’t care about response time, but they do have to finish in a certain window of time. Bad things happen when these jobs linger into the daytime and mess up the live user response times. I’ve never seen a perfect solution to this problem, so favor the live users and balance the best you can.

Memory

Check to see that the memory that comes with the new system is at least as big as the old system. In general, if your old computer had enough memory for a moderately busy day, then it will be fine at peak load. However, I’ve seen a couple of cases where the memory usage scaled up and down with the load, so look at the memory usage over time on your old system and see how it changes over the day, and from day to day. Memory upgrades are sold in such huge chunks that you don’t have to be very precise in estimating memory needs. If you need more, the next step up will be a huge improvement.


For more info on performance modeling see: The Every Computer Performance Book at  Amazon, Powell’s Books, and on iTunes.


 

Don’t Get Carried Away

No matter how good you get at capacity planning, don’t forget that it is just a very detailed guess about the future, so try not to over promise like the gentlemen below.

Source: “Fantastic Four Giant” #5. Invented by Stan Lee and Jack Kirby

 

Capacity Planning: It’s Not JUST About The Peak

Even though capacity planners naturally tend to focus on the peak minute of the peak hour of the peak day, remember that most complex commercial systems have other stresses in their lives. There are the additional things to factor in. For example:

  • Off-peak downtimes where some fraction of the hardware is offline and the total load is carried by what remains
  • Backups
  • Nightly reports and other batch jobs
  • Activity that is calendar driven like any special end of week, month, quarter, year work that has to be done
  • How the available computing power is degraded during site upgrades
  • What happens when a part of your computing world briefly goes offline and then the backed up work surges in like a tsunami

Man vs. Machine

It’s usually easy to spot machine generated loads in performance data.machine

Above you see the classic performance signature of machine driven work, a nearly instant-on load with no normal change in intensity as the day progresses. Also, the work runs through the system at a relentless pace because there is no think time.

Sometimes a well thought out solution that works during a peak day kills you at night, kills you on the weekend, or kills you at the end of the month or quarter. Don’t just meter the peaks. Meter all year, and notice when these timed or special events happen. The boss may decide to include these special events in your capacity planning, or they may just decide to take the chance that any special event will happen during a conveniently low demand time. Different businesses are willing to make different tradeoffs between money spent and risks mitigated.

The Human Response

bossWhen you present your capacity plan be prepared for at least one round of adjustments. Even with the safety margin and max utilization values built into the calculations, some managers will still be uncomfortable with your results.

Sometimes you are too close to a limit, and the boss will have you add resources to the plan. Sometimes a resource is seen as “too idle”, and that will bother them. Sometimes your results do not support their empire building goals or cost savings targets, and you’ll be asked to change the plan.

Adjustments can be made, but do what you can to make sure the plan sticks to the truth. If your boss tries to force you to lie or profoundly fudge the numbers or “reframe the truth”, do your best to resist.

This post started as an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.

When To Do Capacity Planning

calendarEvery organization is different. Some are nimble, some ponderous and slow, some have money to burn, and some won’t see new hardware for years. You need to factor these realities into your plans as capacity planning is only helpful when it is done with enough lead time to fix the discovered future bottlenecks. Some things to consider:

  • There are times when money is available in your budget and usually a time by which you have to spend it or lose it. Any purchases you plan for have to hit this window.
  • How long does it typically take your company to complete a major purchase once the decision has been made to buy? A new computer that arrives two weeks after the seasonal peak doesn’t do you any good.
  • How fast can your vendor build, deliver and install the thing you need? Some things you can have tomorrow; some things take months to build and deliver.
  • Do you want to wait for the newest/fastest machine just on the horizon or go with last year’s model now?
  • Are there any company-wide spending edicts in place?
  • When is the end of the quarter for key vendors? Buying hardware at the end of the quarter (especially the fourth quarter) puts you in a better bargaining position.

With time all things are possible, so the best advice is to start as soon as you can.

Capacity Planning For Tough Times

fire-moneyThere are two situations where you create capacity plans that assume you have less computing resources than you have now: disaster recovery and budget cutting. In both cases the first thing that needs to be done is make some hard business decisions about suffering and money.

Suffering

Is your company willing to let the users feel a little response time pain if a key system or device fails in order to save some money? Every company I’ve ever worked for had a different answer, and that answer changed depending on the market they live in, and their current financial situation.

As a capacity planner, you are looking for a clear statement about response time and throughput goals such as:

  • If X fails, our customers shouldn’t feel a thing even if it happens on the busiest day of the year.
  • If X fails, our customers shouldn’t feel a thing on an average day. We are willing to gamble that this will not happen on the busiest day of the year.

In the above bullets X is the thing you are considering doing without and could be as small as a comm line or as large as a whole datacenter.

Money

At the end of the day, it is all about money. If you are asked to capacity plan for less hardware, and the chief consideration is money, then you need to know:

  • How much money is the company trying to save?
  • Where are they looking for savings: hardware cost, software licensing, facilities, etc.?
  • What are the savings per unit of hardware, software, floor space, etc.?

In many ways this is like playing the game Monopoly™ when you are low on cash. Suddenly you need to pay rent, so you have to return your buildings to the bank for the cash you need. You will select those buildings based on what they are worth and what they could make in rent as the game goes forward. The same is true in capacity planning. If the boss says he needs to save $200,000 in costs, then you need to know what eliminating a given machine will save you.

Dealing With Disaster

fireWhen planning for disaster, decide what part(s) of your world will be suddenly unavailable, and then mathematically shift the load to the still-working parts of your world.

For example, your overall projected peak load is 150 transactions per second (TPS), and you have three front end machines that each take 1/3 of that load. To capacity plan for handling that peak load with one of the front end machines down, just divide the peak load by two machines rather than three. So at peak the two remaining machines will each need to handle 150TPS / 2 = 75TPS.

On a normal, non-peak day, each front end machine handles about 25TPS during the busiest part of the day, and the measured utilization of system resources is what you are going to use to scale up the load. To get the scaling factor you need for your capacity plan, divide your per-machine projected peak load by the measured load.

If at peak everything is working normally, each front end machine will be handling 150TPS / 3 = 50TPS, and thus it will be doing twice the work 50TPS / 25TPS = 2X of your metered day.

If at peak one front end machine emits a puff of greasy black smoke and dies, then the remaining two machines will each be handling 150TPS / 2 = 75TPS, and thus they will each be doing three times the work 75TPS / 25TPS = 3X of your metered day.

When doing this work, you can also scale the disaster up to the datacenter level (What if a hurricane takes out our North Carolina datacenter?) or down to a single communications line. You just have to figure out if X breaks, what fraction of the pre-failure load will the remaining hardware have to pick up. Then add that to the projected peak for which you are capacity planning and do the math.

There are as many disaster scenarios as there are things that can break. You might get overwhelmed thinking of all the possible combinations, but remember, you don’t have to plan for all scenarios, just the worst ones. If you need to plan for various troubles that will leave System X carrying 110%, or 250%, or 300% of the metered load, do the math for the worst case first. If the system can handle a 300% load increase, it can clearly handle 250% or 110% easily.

Budget Cuts

moneyWhen looking to reduce equipment costs, you have to figure out how to do more, with less. It is almost never the case that there is some easily applied magic fix. Instead, typically, workload has to be moved to fewer systems which entails some unforeseen consequences and some risk to the stability and availability of your computing world. It can be done, but there are some things to which you need to pay attention:

  • Computers are not interchangeable homes for processes as they have different hardware that can support different versions of operating systems. Clearly applications designed for the X operating system won’t run on the Y operating system. That’s obvious. What is often missed, is there can be compatibility problems between version 10.1 and 10.2 of operating system X. Sometimes the hardware you plan to keep can only run 10.2, but the third-party code you depend on has a nasty bug in their 10.2 release and so you need to hold at 10.1.
  • Computers need to connect to the world, and some of them may require specific hardware that is only available on specific machines and versions of operating systems.
  • There are always the issues posed by who “owns” each machine and how that that machine is accounted for in the budget. Even if you save the company $50,000, if that money is not in the right part of the budget, it doesn’t solve your problem.
  • There may be legal limits preventing you from having unencrypted data on certain networks or security constraints that prevent certain people with admin-level privileges from access to certain systems.
  • Whatever plan you come up with can’t screw up your disaster recovery plans.
  • If you’re planning to turn off some piece of hardware, then you need to account for everything going through that machine. Frequently, there is more interconnectivity and dependency than you see at first.
  • Processes have all sorts of connections, communication paths, and shared resources. Some of these run much slower when accessed across the net vs. locally. Some can only work if the processes are on the same machine.
  • Whatever files or databases you move will need storage space and a sensible way to back them up.
  • Speaking of files, it is amazing how many files a process needs, with some of them only accessed on special occasions. When you change the file structure it is not at all unusual to find critical files that have been completely forgotten by the local experts. Also, changes to directory and file access permissions (read, write, execute) can cause trouble. Plan to spend considerable time hunting files, finding connections, and debugging the results of the move.
  • When moving a process between machines with different CPU speeds, if the two machines are made by the same vendor, they can usually give you a reasonable number to scale the CPU utilization. If moving to a different vendor’s machine, you might just have to do your own testing.

Doing the work of moving workload, processes, files, and networks is a deeply detail-oriented, complex task. Start by creating your plan with what you know, then spend time checking all the things on this list and whatever else you think of. Most of your plan will work, some of it won’t. Adjust and recheck all your assumptions. Repeat this plan/recheck process until it all seems to work while keeping your boss in the loop to be sure the money saved is sufficient and the politics are working out as well.

This post was built on an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.

Capacity Planning Guarantees You Can’t Make

Capacity planning should never be sold as a guarantee that all will be well at the next peak. No matter how good a performance person you are, you can’t offer that guarantee. monstersWhy? You cannot prove a negative. For example, you can’t prove there are no monsters waiting to get you while you sleep because no matter how carefully you check, you might overlook some spot (like the closet) where they are hiding.

Liebig’s Law clearly shows that even a small and obscure part of the transaction path can become a major bottleneck if given enough work to do.

Capacity planning is more like a pre-trip checklist to ensure you have what you need, and all systems on this list are good-to-go. Invariably, you will go on that trip, and somewhere along the way you’ll discover you forgot X, don’t have enough Y, and for the first time ever you need Z. That’s all bad news, but remember that your capacity planning effort found bottlenecks that would have limited your throughput even more.

Even if you use load testing to add to your confidence, no load test is perfect, so you still can’t honestly guarantee a trouble-free peak.

So, do capacity planning to the best of your ability with the things you do know about, load test to the best of your abilities, but make no absolute promises. If you get caught short on some resource, take the time before the next big peak to learn about that resource and to do a more complete plan next time. Unless this is the last peak before you retire, you need to think long-term.

The Every Computer Performance Book

coverThis short, occasionally funny, book covers Performance Monitoring, Capacity Planning, Load Testing, Performance Modeling and gives advice on how to get help and present your results effectively.

It works for any application running on any collection of computers you have. It teaches you how to discover more about your meters than the documentation reveals. It only requires the simplest math on your part, yet it allows you to easily use fairly advanced techniques. It is relentlessly practical, buzzword free, and written in a conversational style.

Most of the entries in this blog begin with what I put in the book. The book is available from Amazon in paperback and from Apple in iBook format. Both are priced at ~$9 USD. Why so cheap? Because I retired early (mostly due to my computer performance work) and so I wanted to give back what I learned in the hopes that the next generation can do the same.

 

Hints On Metering For Capacity Planning

Capacity Planning is projecting your current situation into a somewhat busier future. To do this work you need to know what is being consumed under a relatively stable load, and then you scale that number to the projected load. So if some component of your computing world is 60% busy under the current load, it most likely will be 120% busy (and thus a serious bottleneck) under 2X the current load.

normal

To meter for capacity planning you should collect data from every part of your computing world that shows:

  • The current workload (transaction rate)
  • How busy it is (utilization)
  • How full it is (size, configuration or capacity limits)

Essentially you are looking to see how close you are to hitting any limit to growth, no matter how insignificant.

If there are well understood and closely watched workload meters for your system (e.g., “Currently we are handling 5,500 transactions per minute”) then collect them. You can do capacity planning without a workload meter as, in most cases I’ve worked on, the boss asked for a general scaling of the current load like: “Can we can handle twice our daily peak load?”

It is best to meter when the system is under a stable load, one where the load is not changing rapidly, because then you can get several samples to be sure you are not seeing some odd things that are not connected to the overall load. Below we see some samples where the load is stable over time, but there are things to notice here.

samples

Ignoring the sample at 12:15 for the moment, notice that the overall the load is stable. It will never be perfectly stable where each sample gives you exactly the same numbers. Some variation will always happen. Values plus or minus 10% are fine.

If the overall load is never stable, then pick some metered value, like X TX/min, and look at all your samples for any sample you collect that shows X TX/min (± 10%) and see if the other numbers you are tracking, like disk busy and CPU, are stable in relation to it.

In the 12:15 sample the CPU usage essentially doubles even though the disk busy and the TX/min number are stable. This is either some oddball thing that happened and is not normal, or perhaps at 12:15 every day this happens, or perhaps this happens 10 times a day at somewhat random times. Either set yourself to solving this mystery, or if you are sure this is a non-repeating event, ignore the 12:15 data.

This is why you just don’t meter for 20 minutes on one day when doing capacity planning. You’ll look at lots of data, over many days, and try to come up with a reasonable estimate of how busy everything is under a given load. If there are days where unique demands, not related to load, are placed on your computing world, be sure to meter through those times, too. The whole reason for capacity planning is typically to ensure you have enough resources to handle a peak load, even if it arrives on a day when the system has other things to do, too.

A common mistake people make is to look at the CPU consumption of a process and use that to estimate how much more work it could handle. They might notice that a process is only consuming six seconds of CPU per minute and thus deduce that the process is only 10% busy (60/6 =0.10) but that’s usually wrong. Processes do other things besides burn CPU. They wait for disk IO, they wait for other processes to return an answer they need, they wait for resource locks, etc. A process can be doing all the work it can and still burn only a small fraction of the CPU it could if it were just crunching numbers. Gather the per process CPU consumption, but don’t mistake it as a straight-forward way to gauge how busy a process is.

When metering for capacity planning, gather any data you can for limits that can be hit. Notice things that kill performance, or sometimes whole applications, when they get full, empty, or reach some magic number. Any limit you hit, will hurt.

caution7feet

Some examples of limits your computing world might hit: disk space, available memory, paging space, licensing limits on the application, number of server processes, maximum queue depth, networking port limits, database limits.

Also consider the limits that per-user or per-directory security and quotas bring to the party. Look for limits everywhere and record how close you are to them.

This is an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.

The Four Numbers of Capacity Planning

Capacity planning is projecting your computing world into a somewhat busier future to see if you have enough of everything to get the job done. Here I share a practical approach for doing capacity planning that boils all the performance data you have for any resource down to just four numbers and a simple calculation.

Doing capacity planning for any computing resource that gives you a utilization is essentially multiplying three numbers together (Utilization * Scaling factor * Safety Margin) and then comparing the result with your maximum utilization for the given resource. If the calculated utilization is greater than the maximum utilization, then you’ve found a future bottleneck. The number of times you will repeat this calculation and comparison depends on the size and complexity of your computing environment. What follows are a few hints on how to do this right.

Utilization

01 TimeStep one of capacity planning is to find a time that you want to base your capacity plan on. Pick a time when the users are sending your computing world a moderate, stable load and the users are happy with your overall response time and throughput. Your daily peak load is often a good place to start. Typically that time is decided by watching the system behavior over several days, or weeks, and then selecting some reasonably busy time. Usually, you start by looking at a key system’s CPU consumption as it is the one resource that all transactions consume. Let’s look at some data and pick a number.

02 Graph 2 day

In the above graph you see the CPU busy data from System X averaged over one minute and sampled every 5 minutes for most of Oct 24th and 25th. Hint: For capacity planning you don’t want a long average (ex: data averaged over fifteen minutes) as that can hide small but significant surges in demand.

All transactions pass through this computer. The first thing that jumps out at you is the sharp 90%+ busy peak in the middle of the night. A couple of good questions to ask are:

  • Is that normal processing or did something go haywire?
  • If normal, how often does this happen?
  • Does this ever happen during the daily peak user load?
  • Are we going to tolerate the response time increases this late-night job creates for about 30 minutes, or do we need to capacity plan for this too?

If this is a background job that runs in the middle of the night when nobody cares about response time, then you can ignore it. If the boss cares about response time 24×7, then you have to plan for this, too.

If this peak is the result of some event that suddenly and rapidly dumps work into the system (e.g. a computer coming back online after a communications disruption) then you need to plan for this. You might still go ahead and build your plan with the normal transaction load, but you’ll need to mention the possibility of this spike happening at the worst possible moment in your written capacity plan.

For the moment, let’s ignore the early morning peak, and change the scale on the Y-axis so we can see the data we really care about more clearly.

03 Graph 2 day scaled

Now that we’ve trimmed the Y-axis to a max of 50% busy, the next question is which day’s peak to use. The daily peaks on the 25th are a bit higher and somewhat more consistent than the daily peaks on the 24th, so let’s focus on those peaks and redraw the chart to only show that one day.

Below we can see a sustained 45 minute long peak of 27% busy and a maximum recorded value of 29% busy.

04 Graph 2 day labeled

The period where we see the sustained peak, holding steady through multiple samples, is a good place to start. It’s good to have multiple adjacent samples in agreement because it gives you confidence that the overall system was at a steady state.

The highest value recorded that day was 29% busy, but you’ll notice it is not a sustained peak. Someone may look at that chart and say you have to capacity plan based on the busiest moment. Here it makes little difference as 27% and 29% are not that far apart. However, sometimes there are bigger differences, and sometimes the person insisting on using the highest values is your boss. If your boss is adamant, then go with it. Why fight over a small difference? Pick your battles.

So now you’ve got two things: a time range for a steady state peak and a value for CPU busy (27%) on this system.

Before you declare victory and go out for a long lunch, look at the metering data from the other systems and devices in your computing world and see if they are showing a peak at about this time with about the same shape. It is entirely possible that the other systems in your computing world did not see this sustained 20 minute peak because it was caused by a little program someone ran unbeknownst to you on this system.

If you want to build a capacity plan based on this observed peak, it should also show up throughout your computing world during the same time and with the same magnitude. The meters should make sense to you at all times, not just when it is convenient.

The Scaling Factor

05 ScalingThe scaling factor is a number that represents how much busier the future is anticipated to be when compared with the time you sampled. Sometimes it is based in fact (e.g. you just bought a competitor and you know how much business they will bring), and sometimes it is a guess pulled right out of thin air. To get the scaling factor:

  • Collect performance data from your computing world
  • Pick a time when the load is moderate and level
  • Show your graphs to the key players
  • Have them give you the scaling factor

The scaling factor they pick will typically be an SRN (Suspiciously Round Number) and, in my experience, it is never exactly right, but it is often close. Humans are an amazing species. If you are tasked with making this guess, I recommend using the Delphi Technique as that can be quite helpful in getting an unbiased group opinion.

Capacity planning assumes that the resource demands of most applications scale linearly because, within normal boundaries, they do. To be precise, 99% of the computer programs I’ve seen increase their resource consumption in direct proportion to their throughput. If you push twice the work though an application, it will consume close to twice the computing resources such as: CPU cycles, disk IO’s, packets sent/received. The exceptions to this rule are:

  • When the system or application is starting up. At startup, files have to be opened, programs paged in, caches filled, and initializations performed. Unless you are studying ways to recover faster after a restart/crash, ignore the meters during this time.
  • When the application is hopelessly bottlenecked. When overwhelmed algorithms designed to manage about ten things in the queue suddenly have a billion things in queue, they don’t work well.
  • When errors are happening. They cause retries and retransmissions, poorly tested and inefficient error handling routines to run, processes to crash and restart, and general suffering.

When capacity planning, none of the above apply. You don’t capacity plan a system reboot or an application restart, you don’t plan to have the peak load experience a bottleneck, and you can’t plan for all possible errors. However, you can create a capacity plan that takes into account the load being shifted to the remaining systems when a system fails. If three devices are caring the user load and each device is 30% busy. When one device fails, the other two will each take half that load and so they should each be about 45% busy.

Convert any scaling factor they give you into a multiple of one. So, for example, “Plan for a 30% increase” becomes 1.3, “Twice the load you metered” becomes 2.0.

The Safety Margin

06 CliffEvery company has a level of corporate courage and a certain aversion to pain. These attributes are shaped by their people, their culture, how much money they have to spend, and by their recent disasters. It is the rare company that likes to hang by a fingernail on the edge of disaster. Only a fool will insist upon planning for a peak load where every resource is at its maximum utilization as there is no margin for error. Most people feel better, and make better decisions, when they have a margin of safety. Besides, capacity planning is not a perfect science:

  • The workload mix and intensity are always changing and are somewhat difficult to precisely predict. They are often influenced by such unpredictable forces as the weather, the economy, and your competitors.
  • Strange things happen in big companies, and sometimes the demands of other parts of the business on shared resources can change without warning and not in your favor.
  • Software upgrades, network changes, equipment swap-outs, and configuration adjustments that happen between the plan and the actual peak can alter performance.
  • Capacity planning looks only at the question of “enough.” It can give you no hint about the response time changes you will see at the projected peak load. The closer you are to the limit for a given resource, the uglier the response times consequences will be if even a little more work shows up than you planned on.

Do not let the person that gave you the scaling factor tell you that it includes the safety margin. You need to keep these two values separate. The scaling factor is your estimate of the future load; the safety margin is how sure you are about that estimate.

The safety margin inflates your projected utilizations by the percentage that you choose. Most companies I’ve worked with have chosen a safety margin value between 10% and 50%, with the most common values in the 20% range. When using this in capacity planning, convert it into a multiple of one. A 20% safety margin becomes 1.2.

Max Utilization

07 TachFor resources that provide a service there is a utilization beyond which the delays caused by queuing effects become too painful. Your company may have pre-defined corporate-wide guidelines for max utilization for given resources. If so, use them. If not, then you’ll be choosing a number between 50% and 80-90%. Here is some guidance to help you choose.

How Busy Is Too Busy?

According to queueing theory, any device (service center) that is 50% busy will have an average response time of twice the service time as there is on average one job to be processed before you get your turn to use the device. The slower the device is, relative to the other things in your computing world, the more painful this math becomes. In the late 20th and early 21st centuries computers used spinning magnetic disks for long-term storage. Disks were the slowest part of any computer system by several orders of magnitude and thus, the rule of thumb was to keep the utilization of a disk below 50%. If we switch to solid state disks, their vastly lower service time, which translates into a lower wait time and thus a lower response time for a similarly busy device, would justify picking a higher max utilization number.

When picking the max utilization for a device, there is always the temptation to shove money into the discussion with comments like: Those disks were very expensive, and now you are telling me I can only use X% of their capacity. Your reply to incredulous comments like that should point out, in a gentle way, that the cost of running some device at 100% busy is remarkably bad response times for the customers – see queuing theory.

Device exclusivity plays a big part in the number you pick for max utilization as well. When a process needs CPU, any one of the multiple CPU’s in the system will do. If you can go to many places to get serviced then the odds are good that one of them will be free and that will hold down response time as the utilization climbs toward 100%.

On the other hand, if the data you need is on one device (e.g. reading a specific record on a specific disk), then you can only go to that device, and the response time becomes ugly at a much lower average utilization.

To pick a max utilization number for a given device, start by doing your homework; read the manuals, search the web, and then talk to the vendor. If you are laughing at my suggestion to start by reading rather than calling, remember that when you call, you will end up talking to either: people who don’t know, so they bluff and bluster, or people who do know. If you start as an informed person, you can quickly identify and disregard the people who don’t know. If you luck out and get to talk to someone who does know, they are more likely to give you a rich and complete answer because it is clear that you’ve done your homework.

When calling, I’ve had the best luck starting with the technical sales people who are assigned to your account, and then people in the professional services or customer service group. Also remember that questions of max utilization have direct impact on how many of these devices your company will buy. People may be very cautious in their responses because nobody working for a vendor wants to screw up a potential sale.

You can also experimentally select a max utilization value of some resource through testing by adding load to the resource until you see the response time start to grow unacceptably. The number you get should be between 50% (for really slow resources) and 80-90% for the speediest resources where the incoming work also has many service centers to choose from.

How Busy Is Too Busy For A Process

A process executes code until it has to wait for something like a reply from another process, a lock, or an IO to complete. CPU consumption for a process does not tell the whole story of how busy it is. Here are some guidelines to help you figure that out:

  • With rare exception, a process that is 100% CPU busy (burns one second of CPU per second) can’t do anything more for you, has most likely hit a bug, and is doing nothing useful for you.
  • A common performance analysis mistake is to assume that a process consuming only a small amount of CPU couldn’t possibly be the bottleneck. Processes wait for things, and when they wait, they consume no CPU.
  • If the software (and the hardware it runs on) has remained unchanged since the last peak load, you can take the per process CPU utilization data from that peak and feel pretty sure that process can consume at least that much CPU at the next peak.
  • Some applications have a dynamically tunable number of processes doing a given task. Often they have the same name (e.g. FE01, FE02, FE03…) with a number appended to it. If the load is spread evenly, you only need to study one example of each group. You can create an artificial peak load for those processes by reducing their number and letting the incoming workload overwhelm them for a minute, or two, while you gather some metering data. Then start additional processes to return things to normal. Note: This is not a perfect test, but it is better than nothing.
  • If you can see the queue of incoming requests for a given process and that queue is never empty, it’s clear that the process is working about as fast as it can, regardless of how little CPU it consumes. However, that process may not be the root cause of the bottleneck. In the example below, for every transaction Process X works on, it has to ask the somewhat slower Process Y for a reply before proceeding. Process Y is on a different machine, and that is why you can see Process X bottleneck and backup even though there are plenty of resources on System X.

08 Sys XY

  • Use your common sense. If you are planning for a peak that is ten times the load you are currently measuring, and the process in question is already consuming 0.2 seconds of CPU/second, then clearly this process will be using 10 * 0.2 = 2 seconds of CPU/second, which is impossible.
  • Some processes are involved in the main transaction path, and some come into play less often. Focus your efforts on the main transaction path processes. How do you identify them? The processes that consume most of the resources and whose consumption rises and falls as the load does are the processes you want to study. They are typically a small subset of all the processes running.
  • If none of the above suggestions work for you, then you either have do some load testing or make a good faith estimate. When estimating, include others in the process and use the Delphi Technique.

When you have a max utilization (expressed as a number from zero to one) for all your resources then you are ready to do some capacity planning.

Doing The Math of Capacity Planning

Now you are ready to scale up to the projected peak load. The formula is straight forward:

Utilization * Scaling Factor * Safety Margin = Projected Peak

Imagine you have a resource that is 40% busy (utilization is 0.4), and you need to plan for a peak load that is 50% larger than what you are seeing now (scaling factor is 1.5), and you are reasonably sure of your projected peak load within about 10% (safety margin is 1.1).

Doing the math you get 0.4 * 1.5 * 1.1 = 0.66 and now you know this resource will be 66% busy at the projected peak load. You’ve determined the max utilization for this resource is 75%, and so you feel reasonably sure that this resource will not be a bottleneck at your projected peak. Now do that calculation for all your other resources.

When presenting capacity planning results, you can quickly overwhelm the audience with numbers and, despite their keen interest in your results, they will stop listening, reading, and caring after just a few sets of this data. To help them absorb the results, show the data graphically and use the same colors, line size, and wording in each chart for the measured, projected, and maximum utilizations.

09 Graph projected

That way a quick glance is all that is needed to know if there is a problem at the projected peak and how close to the max utilization this thing is. Also take into consideration that you’ll be projecting your results in color, but your audience likely will be holding a blank & white photocopy in their hand. Strive for clarity.

Lastly

10 tikiCapacity planning can fail when a resource, not included in the plan, bottlenecks under the peak load and ruins your post-peak Tiki bar celebration. Capacity planners (i.e. you)
should learn from previous failures. If you run out of something, figure out how to meter that thing, and add it to the next capacity plan. Don’t fall in the same hole twice.

For More Information:

There is a lot more on capacity planning, and many other useful performance insights, in my book: The Every Computer Performance Book

A short, occasionally funny, book on how to solve and avoid application and/or computer performance problems