Your computing world is changing. To handle the next projected peak, your sales team has suggested you upgrade your computer to the new model, which they claim is three times faster than your old computer. How do you model a machine you have no experience with?
Regardless of the sincerity of the sales team and their dedication to truth, the claim that the new machine is three times faster is wrong in the same fundamental way that assuming your SAT scores predict your ability to write a best selling novel. Every time you switch hardware, some parts are faster, some hardly change at all, and occasionally, some parts can run more slowly. It’s how your unique application uses those parts that will determine how much more work the new system will handle.
So where to begin?
Start with the simple things. Do all the calculations that you can do simply and easily first. If they work out, then move on to the more detailed and complex work. If they don’t work out, then you have to rethink your answer, and you’ve just saved yourself the time you would have wasted on detailed analysis.
For example, that three times faster number they gave you is usually heavily weighted toward the CPU performance. So capacity plan your current system for your peak load and check to see if it will “fit” into your new system.
Let’s say the next projected seasonal peak is 5X busier than a moderately busy day you metered recently. On that day the system was about 30% busy. Do the math (5* 30% = 150%) to see that your old system would be 150% CPU busy at peak. The new machine is 3X as fast as your old machine, and it only has to be 1.5x faster to (barely) handle the load. Chances are you are good to go CPU wise.
If the numbers had been uncomfortably close (e.g., the new machine was 1.7X faster than the old one), then more testing and checking would be in order. Remember, the closer you are to the edge of a performance cliff, the more precisely you have to know your position to stay safe. If it looks like the device in question is going to be over 50% busy, consult this post on queuing theory effects to get a rough estimate of the response time penalty you will pay.
Now, dig though each part of the machine to make sure this upgrade will do the job. Do one thing at a time. Take good notes. Write your capacity report as you go.
The Hard Truth About Scaling
For any computer, application, or process you’re ever likely to encounter, the following describes the transaction path:
- Bits go in.
- Bits are transformed by the CPU.
- You may have to wait as bits are sent to, or requested from, local storage or some other computer.
- Bits go out.
It is in step three that your dreams of magical performance increases and simple scaling go to die. See Amdahl’s Law and Liebig’s Law Of The Minimum. Compared to pushing bits around in memory, waiting for data requested from local storage and other computers is tremendously slow. Also, when you upgrade a system, the time to fetch bits from local storage, or another computer, rarely keeps up with the overall speed increase the sales team promised you.
For example, if a process needs to read one record from disk for every transaction, then that IO is may be the biggest throughput limit. Even when you upgrade to a faster CPU, the disk runs at about the same speed, and so the transaction duration does not scale well as you see below.
So, to handle 5X the load with your new machine you may need to add more processes. Any given process can only do so many transactions per second, and that number may not scale up to match the overall speed increase the salesperson claimed for the reasons outlined above. Let’s work through an example.
A Trick For Estimating Process Throughput
Many of the applications I’ve worked with had the ability to dynamically add new processes if incoming workload required it. A trick I’ve used to find the maximum throughput of a process is to start fewer than are normally required and then wait for the user workload to build as the day progresses. I’d watch closely for signs that the transactions are backing up and, when I felt I’d hit the maximum throughput, then I’d start the regular number of processes. Lastly, I do a bit of simple math to calculate the throughput of a process.
- With 2 processes I hit max throughput at 100 TX/sec. That gives me: 100 / 2 = 50 TX/sec per process and I know that each transaction takes about 20ms total time as 1000ms / 50 = 20ms
- During testing each process used ~150ms/sec of CPU. That gives me: 150ms/50TX= 3ms CPU/transaction
- The CPU on the new machine is three times as fast. That gives me: 3ms / 3 = 1ms of CPU estimated per transaction on the new machine
- Each transaction on the new machine will spend 2ms less time computing so the average transaction time will be 20ms – 2ms = 18ms
- So that works out to a max throughput of ~55 transactions per second as 1000ms / 18ms = 55.5555
It can be tempting to display lots of decimal places in the numbers you come up with as that gives the illusion of precision. However, the numbers you started with are typically not all that precise. Furthermore, if the future of your company hangs on a tenth of a transaction per second then you are cutting it way too close for anyone’s comfort.
So, on the old machine, each process could handle 50TX/sec, and on the new machine each process can theoretically handle 55TX/sec. Now you see why you’ll need more processes to handle the load even though the machine is much faster.
Just like waiting for bits from local storage, waiting for bits from another computer can take up a big chunk of the overall transaction response time.
You can do the same basic trick that we just did with local storage to find the max throughput of a given key process. When doing this work, make sure to lookout for comm errors. You can’t eliminate all comm errors, especially if the Internet is involved, but keep an eye on them as you are gathering your data. If there seems to be a significant increase in comm errors while you are gathering your data, that can have a big affect on throughput.
Look at the communications capacity to see if it can handle the projected peak load, which is 5X the traffic that your old system handled on a moderately busy day. Also be sure there is room for this increased traffic on whatever parts of the corporate network these packets flow through.
At the time of this writing, local storage, typically rotating magnetic disks, are the slowest part of any system and the most likely thing to bottleneck.
When upgrading, think of each disk not only as storage space, but as a device that can only give you a finite number of IO’s per second. A new 2TB disk that can perform 200 IOs/sec is not the same as four older 500MB disks that can each perform 150 IOs/sec, but together the four older disks can perform 600 IO/sec.
A single process waiting for disk IO will notice a speed improvement when it is using the faster 2TB disk, but all the processes doing IO will overwhelm the 2TB disk long before they overwhelm the four 500MB disks.
When moving files to the disks of the new system, remember that the size of the file tells you nothing about how many IO’s per second the application does to it. If your operating system gives you per-file IO data, then use it to balance the IO load among your disks. If there are no per-file meters, then you need to have a chat with the programmers and take your best guess as to how much IO is going to each file. Once you decide on a plan, move only a small number of files at a time and see how each move goes.
The other thing to consider when balancing disk IO is the IO load of periodically scheduled jobs like backups, overnight processing, end-of-month reports, etc. I’ve seen systems that were rebalanced nicely for production that became a nightmare when these background jobs were run. These background jobs don’t care about response time, but they do have to finish in a certain window of time. Bad things happen when these jobs linger into the daytime and mess up the live user response times. I’ve never seen a perfect solution to this problem, so favor the live users and balance the best you can.
Check to see that the memory that comes with the new system is at least as big as the old system. In general, if your old computer had enough memory for a moderately busy day, then it will be fine at peak load. However, I’ve seen a couple of cases where the memory usage scaled up and down with the load, so look at the memory usage over time on your old system and see how it changes over the day, and from day to day. Memory upgrades are sold in such huge chunks that you don’t have to be very precise in estimating memory needs. If you need more, the next step up will be a huge improvement.
For more info on performance modeling see: The Every Computer Performance Book at Amazon, Powell’s Books, and on iTunes.