Capacity Planning is projecting your current situation into a somewhat busier future. To do this work you need to know what is being consumed under a relatively stable load, and then you scale that number to the projected load. So if some component of your computing world is 60% busy under the current load, it most likely will be 120% busy (and thus a serious bottleneck) under 2X the current load.
To meter for capacity planning you should collect data from every part of your computing world that shows:
- The current workload (transaction rate)
- How busy it is (utilization)
- How full it is (size, configuration or capacity limits)
Essentially you are looking to see how close you are to hitting any limit to growth, no matter how insignificant.
If there are well understood and closely watched workload meters for your system (e.g., “Currently we are handling 5,500 transactions per minute”) then collect them. You can do capacity planning without a workload meter as, in most cases I’ve worked on, the boss asked for a general scaling of the current load like: “Can we can handle twice our daily peak load?”
It is best to meter when the system is under a stable load, one where the load is not changing rapidly, because then you can get several samples to be sure you are not seeing some odd things that are not connected to the overall load. Below we see some samples where the load is stable over time, but there are things to notice here.
Ignoring the sample at 12:15 for the moment, notice that the overall the load is stable. It will never be perfectly stable where each sample gives you exactly the same numbers. Some variation will always happen. Values plus or minus 10% are fine.
If the overall load is never stable, then pick some metered value, like X TX/min, and look at all your samples for any sample you collect that shows X TX/min (± 10%) and see if the other numbers you are tracking, like disk busy and CPU, are stable in relation to it.
In the 12:15 sample the CPU usage essentially doubles even though the disk busy and the TX/min number are stable. This is either some oddball thing that happened and is not normal, or perhaps at 12:15 every day this happens, or perhaps this happens 10 times a day at somewhat random times. Either set yourself to solving this mystery, or if you are sure this is a non-repeating event, ignore the 12:15 data.
This is why you just don’t meter for 20 minutes on one day when doing capacity planning. You’ll look at lots of data, over many days, and try to come up with a reasonable estimate of how busy everything is under a given load. If there are days where unique demands, not related to load, are placed on your computing world, be sure to meter through those times, too. The whole reason for capacity planning is typically to ensure you have enough resources to handle a peak load, even if it arrives on a day when the system has other things to do, too.
A common mistake people make is to look at the CPU consumption of a process and use that to estimate how much more work it could handle. They might notice that a process is only consuming six seconds of CPU per minute and thus deduce that the process is only 10% busy (60/6 =0.10) but that’s usually wrong. Processes do other things besides burn CPU. They wait for disk IO, they wait for other processes to return an answer they need, they wait for resource locks, etc. A process can be doing all the work it can and still burn only a small fraction of the CPU it could if it were just crunching numbers. Gather the per process CPU consumption, but don’t mistake it as a straight-forward way to gauge how busy a process is.
When metering for capacity planning, gather any data you can for limits that can be hit. Notice things that kill performance, or sometimes whole applications, when they get full, empty, or reach some magic number. Any limit you hit, will hurt.
Some examples of limits your computing world might hit: disk space, available memory, paging space, licensing limits on the application, number of server processes, maximum queue depth, networking port limits, database limits.
Also consider the limits that per-user or per-directory security and quotas bring to the party. Look for limits everywhere and record how close you are to them.
This is an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.