Capacity Planning For Tough Times

There are two situations where you create capacity plans that assume you have less computing resources than you have now: disaster recovery and budget cutting. In both cases the first thing that needs to be done is make some hard business decisions about suffering and money.

Suffering

Is your company willing to let the users feel a little response time pain if a key system or device fails in order to save some money? Every company I’ve ever worked for had a different answer, and that answer changed depending on the market they live in, and their current financial situation.

As a capacity planner, you are looking for a clear statement about response time and throughput goals such as:

If X fails, our customers shouldn’t feel a thing even if it happens on the busiest day of the year.
If X fails, our customers shouldn’t feel a thing on an average day. We are willing to gamble that this will not happen on the busiest day of the year.

In the above bullets X is the thing you are considering doing without and could be as small as a comm line or as large as a whole datacenter.

Money

At the end of the day, it is all about money. If you are asked to capacity plan for less hardware, and the chief consideration is money, then you need to know:

How much money is the company trying to save?
Where are they looking for savings: hardware cost, software licensing, facilities, etc.?
What are the savings per unit of hardware, software, floor space, etc.?

In many ways this is like playing the game Monopoly™ when you are low on cash. Suddenly you need to pay rent, so you have to return your buildings to the bank for the cash you need. You will select those buildings based on what they are worth and what they could make in rent as the game goes forward. The same is true in capacity planning. If the boss says he needs to save $200,000 in costs, then you need to know what eliminating a given machine will save you.

Dealing With Disaster

When planning for disaster, decide what part(s) of your world will be suddenly unavailable, and then mathematically shift the load to the still-working parts of your world.

For example, your overall projected peak load is 150 transactions per second (TPS), and you have three front end machines that each take 1/3 of that load. To capacity plan for handling that peak load with one of the front end machines down, just divide the peak load by two machines rather than three. So at peak the two remaining machines will each need to handle 150TPS / 2 = 75TPS.

On a normal, non-peak day, each front end machine handles about 25TPS during the busiest part of the day, and the measured utilization of system resources is what you are going to use to scale up the load. To get the scaling factor you need for your capacity plan, divide your per-machine projected peak load by the measured load.

If at peak everything is working normally, each front end machine will be handling 150TPS / 3 = 50TPS, and thus it will be doing twice the work 50TPS / 25TPS = 2X of your metered day.

If at peak one front end machine emits a puff of greasy black smoke and dies, then the remaining two machines will each be handling 150TPS / 2 = 75TPS, and thus they will each be doing three times the work 75TPS / 25TPS = 3X of your metered day.

When doing this work, you can also scale the disaster up to the datacenter level (What if a hurricane takes out our North Carolina datacenter?) or down to a single communications line. You just have to figure out if X breaks, what fraction of the pre-failure load will the remaining hardware have to pick up. Then add that to the projected peak for which you are capacity planning and do the math.

There are as many disaster scenarios as there are things that can break. You might get overwhelmed thinking of all the possible combinations, but remember, you don’t have to plan for all scenarios, just the worst ones. If you need to plan for various troubles that will leave System X carrying 110%, or 250%, or 300% of the metered load, do the math for the worst case first. If the system can handle a 300% load increase, it can clearly handle 250% or 110% easily.

Budget Cuts

When looking to reduce equipment costs, you have to figure out how to do more, with less. It is almost never the case that there is some easily applied magic fix. Instead, typically, workload has to be moved to fewer systems which entails some unforeseen consequences and some risk to the stability and availability of your computing world. It can be done, but there are some things to which you need to pay attention:

Computers are not interchangeable homes for processes as they have different hardware that can support different versions of operating systems. Clearly applications designed for the X operating system won’t run on the Y operating system. That’s obvious. What is often missed, is there can be compatibility problems between version 10.1 and 10.2 of operating system X. Sometimes the hardware you plan to keep can only run 10.2, but the third-party code you depend on has a nasty bug in their 10.2 release and so you need to hold at 10.1.
Computers need to connect to the world, and some of them may require specific hardware that is only available on specific machines and versions of operating systems.
There are always the issues posed by who “owns” each machine and how that that machine is accounted for in the budget. Even if you save the company $50,000, if that money is not in the right part of the budget, it doesn’t solve your problem.
There may be legal limits preventing you from having unencrypted data on certain networks or security constraints that prevent certain people with admin-level privileges from access to certain systems.
Whatever plan you come up with can’t screw up your disaster recovery plans.
If you’re planning to turn off some piece of hardware, then you need to account for everything going through that machine. Frequently, there is more interconnectivity and dependency than you see at first.
Processes have all sorts of connections, communication paths, and shared resources. Some of these run much slower when accessed across the net vs. locally. Some can only work if the processes are on the same machine.
Whatever files or databases you move will need storage space and a sensible way to back them up.
Speaking of files, it is amazing how many files a process needs, with some of them only accessed on special occasions. When you change the file structure it is not at all unusual to find critical files that have been completely forgotten by the local experts. Also, changes to directory and file access permissions (read, write, execute) can cause trouble. Plan to spend considerable time hunting files, finding connections, and debugging the results of the move.
When moving a process between machines with different CPU speeds, if the two machines are made by the same vendor, they can usually give you a reasonable number to scale the CPU utilization. If moving to a different vendor’s machine, you might just have to do your own testing.

Doing the work of moving workload, processes, files, and networks is a deeply detail-oriented, complex task. Start by creating your plan with what you know, then spend time checking all the things on this list and whatever else you think of. Most of your plan will work, some of it won’t. Adjust and recheck all your assumptions. Repeat this plan/recheck process until it all seems to work while keeping your boss in the loop to be sure the money saved is sufficient and the politics are working out as well.

This post was built on an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.

The Every Computer Performance Blog

What works in performance, capacity planning, load testing, modeling, and presenting results

Capacity Planning For Tough Times

Suffering

Money

Dealing With Disaster

Budget Cuts

1 thought on “Capacity Planning For Tough Times”

Leave a comment Cancel reply

Suffering

Money

Dealing With Disaster

Budget Cuts

Share this:

Related

1 thought on “Capacity Planning For Tough Times”

Leave a comment Cancel reply