Capacity Models For New Machines

Your computing world is changing. To handle the next projected peak, your sales team has suggested you upgrade your computer to the new model, which they claim is three times faster than your old computer. How do you model a machine you have no experience with?Norwich

Regardless of the sincerity of the sales team and their dedication to truth, the claim that the new machine is three times faster is wrong in the same fundamental way that assuming your SAT scores predict your ability to write a best selling novel. Every time you switch hardware, some parts are faster, some hardly change at all, and occasionally, some parts can run more slowly. It’s how your unique application uses those parts that will determine how much more work the new system will handle.

So where to begin?

Start with the simple things. Do all the calculations that you can do simply and easily first. If they work out, then move on to the more detailed and complex work. If they don’t work out, then you have to rethink your answer, and you’ve just saved yourself the time you would have wasted on detailed analysis.

For example, that three times faster number they gave you is usually heavily weighted toward the CPU performance.  So capacity plan your current system for your peak load and check to see if it will “fit” into your new system.

Let’s say the next projected seasonal peak is 5X busier than a moderately busy day you metered recently. On that day the system was about 30% busy. Do the math (5* 30% = 150%) to see that your old system would be 150% CPU busy at peak.  The new machine is 3X as fast as your old machine, and it only has to be 1.5x faster to (barely) handle the load. Chances are you are good to go CPU wise.

If the numbers had been uncomfortably close (e.g., the new machine was 1.7X faster than the old one), then more testing and checking would be in order. Remember, the closer you are to the edge of a performance cliff, the more precisely you have to know your position to stay safe. If it looks like the device in question is going to be over 50% busy, consult this post on queuing theory effects to get a rough estimate of the response time penalty you will pay.

Now, dig though each part of the machine to make sure this upgrade will do the job.  Do one thing at a time. Take good notes. Write your capacity report as you go.

The Hard Truth About Scaling

For any computer, application, or process you’re ever likely to encounter, the following describes the transaction path:

  1. Bits go in.
  2. Bits are transformed by the CPU.
  3. You may have to wait as bits are sent to, or requested from, local storage or some other computer.
  4. Bits go out.

It is in step three that your dreams of magical performance increases and simple scaling go to die. See Amdahl’s Law and Liebig’s Law Of The Minimum. Compared to pushing bits around in memory, waiting for data requested from local storage and other computers is tremendously slow. Also, when you upgrade a system, the time to fetch bits from local storage, or another computer, rarely keeps up with the overall speed increase the sales team promised you.

For example, if a process needs to read one record from disk for every transaction, then that IO is may be the biggest throughput limit. Even when you upgrade to a faster CPU, the disk runs at about the same speed, and so the transaction duration does not scale well as you see below.waiting

So, to handle 5X the load with your new machine you may need to add more processes. Any given process can only do so many transactions per second, and that number may not scale up to match the overall speed increase the salesperson claimed for the reasons outlined above. Let’s work through an example.

A Trick For Estimating Process Throughput

Many of the applications I’ve worked with had the ability to dynamically add new processes if incoming workload required it. A trick I’ve used to find the maximum throughput of a process is to start fewer than are normally required and then wait for the user workload to build as the day progresses. I’d watch closely for signs that the transactions are backing up and, when I felt I’d hit the maximum throughput, then I’d start the regular number of processes. Lastly, I do a bit of simple math to calculate the throughput of a process.

  • With 2 processes I hit max throughput at 100 TX/sec. That gives me: 100 / 2 = 50 TX/sec per process and I know that each transaction takes about 20ms total time as 1000ms / 50 = 20ms
  • During testing each process used ~150ms/sec of CPU. That gives me: 150ms/50TX= 3ms CPU/transaction
  • The CPU on the new machine is three times as fast. That gives me: 3ms / 3 = 1ms of CPU estimated per transaction on the new machine
  • Each transaction on the new machine will spend 2ms less time computing so the average transaction time will be 20ms – 2ms = 18ms
  • So that works out to a max throughput of ~55 transactions per second as 1000ms / 18ms = 55.5555

It can be tempting to display lots of decimal places in the numbers you come up with as that gives the illusion of precision. However, the numbers you started with are typically not all that precise. Furthermore, if the future of your company hangs on a tenth of a transaction per second then you are cutting it way too close for anyone’s comfort.

 

 

So, on the old machine, each process could handle 50TX/sec, and on the new machine each process can theoretically handle 55TX/sec. Now you see why you’ll need more processes to handle the load even though the machine is much faster.

Communications

Just like waiting for bits from local storage, waiting for bits from another computer can take up a big chunk of the overall transaction response time.commwait

You can do the same basic trick that we just did with local storage to find the max throughput of a given key process. When doing this work, make sure to lookout for comm errors.  You can’t eliminate all comm errors, especially if the Internet is involved, but keep an eye on them as you are gathering your data.  If there seems to be a significant increase in comm errors while you are gathering your data, that can have a big affect on throughput.

Look at the communications capacity to see if it can handle the projected peak load, which is 5X the traffic that your old system handled on a moderately busy day. Also be sure there is room for this increased traffic on whatever parts of the corporate network these packets flow through.

Local Storage

At the time of this writing, local storage, typically rotating magnetic disks, are the slowest part of any system and the most likely thing to bottleneck.disk

When upgrading, think of each disk not only as storage space, but as a device that can only give you a finite number of IO’s per second.  A new 2TB disk that can perform 200 IOs/sec is not the same as four older 500MB disks that can each perform 150 IOs/sec, but together the four older disks can perform 600 IO/sec.

A single process waiting for disk IO will notice a speed improvement when it is using the faster 2TB disk, but all the processes doing IO will overwhelm the 2TB disk long before they overwhelm the four 500MB disks.

When moving files to the disks of the new system, remember that the size of the file tells you nothing about how many IO’s per second the application does to it.  If your operating system gives you per-file IO data, then use it to balance the IO load among your disks. If there are no per-file meters, then you need to have a chat with the programmers and take your best guess as to how much IO is going to each file. Once you decide on a plan, move only a small number of files at a time and see how each move goes.

The other thing to consider when balancing disk IO is the IO load of periodically scheduled jobs like backups, overnight processing, end-of-month reports, etc. I’ve seen systems that were rebalanced nicely for production that became a nightmare when these background jobs were run. These background jobs don’t care about response time, but they do have to finish in a certain window of time. Bad things happen when these jobs linger into the daytime and mess up the live user response times. I’ve never seen a perfect solution to this problem, so favor the live users and balance the best you can.

Memory

Check to see that the memory that comes with the new system is at least as big as the old system. In general, if your old computer had enough memory for a moderately busy day, then it will be fine at peak load. However, I’ve seen a couple of cases where the memory usage scaled up and down with the load, so look at the memory usage over time on your old system and see how it changes over the day, and from day to day. Memory upgrades are sold in such huge chunks that you don’t have to be very precise in estimating memory needs. If you need more, the next step up will be a huge improvement.


For more info on performance modeling see: The Every Computer Performance Book at  Amazon, Powell’s Books, and on iTunes.


 

Advertisements

Modeling Projects Begin With a Question

In the beginning there is an idea, a goal, a mandate, or a proposal that leads to a performance question.

If you can answer that question though simple performance measurement or capacity planning, then do it and be done.

If this is a new application (and thus no metering is possible), or your computing world is changing radically, then you may need to build a model.

pig

Some ideas take very little modeling to shoot down…

The first step to doing that is to really understand the question. Start with what they give you, and then ask very picky questions to clarify:

Boss: Will this plan to consolidate systems work?

You: At our seasonal peak load?
Boss: At the seasonal peak.

You: How much should I add to last year’s peak to scale for this years peak?
Boss: Plan on adding 10%.

You: How sure are you of that number?
Boss: Pretty sure, plus or minus 10%.

You: So I should plan for 11% (10% + (10% * 10%)?
Boss: No. To be safe, plan for last year’s peak plus 20%.

You: Is there any money available new hardware?
Boss: Not a penny.

Ask all the key players for clarification and additional information in both positive and a negative ways: “What do you want? What must we avoid?” Keep at it until you really understand the critical success factors such as:

  • What success looks like in terms of throughput and response time
  • Constraints on budget and time that will limit your options for achieving the goals
  • Legal and availability concerns that will limit the configuration options
  • What is politically and bureaucratically possible

To be clear, this is not a license to waste people’s time by acting like a three year old asking “Why?” over and over and over. Keep your goal in mind, which is a clearly defined question to answer.

Brainstorm, Refine and Choose

To build a model to find an answer to a question you first have to guess the answer; then you can build a model to see if it is really the answer. To guess the answer, you first brainstorm a list of possible answers and then thin that list down to the best candidates.

Create a List Of Possible Solutions

Start with that question, and your knowledge of your computing world, and come up with a couple of workable solutions. This is the intuitive, creative part of modeling, and it is very much like writing a hit song.  There is no step-by-step procedure that will always end up with a great song. However, there are guidelines that will improve your odds of creating a workable solution to your question:

  • At first, don’t be judgmental. Any idea that, in any way, answers any part of the question is a good one at this point.
  • As you are brainstorming, be sure to write things down.  The saddest thing is to watch someone struggle to remember, like you do with a fading dream, the key insight that made an idea work.
  • Now sift the ideas you have based on key limits and demands of your question.  For example, if there is no money for new hardware, then set those ideas requiring new hardware aside, but do not discard them. You may need these ideas later.
  • If you have an abundance of ideas that may answer the question, then sort them based on simplicity, risk, and total cost, and take the top three.
  • If you have no ideas left that may answer the question, then you should ask the people in charge for guidance as to what in the question can be loosened, ignored, or worked around.
  • If you are stuck, go back to those failed ideas you set aside earlier. See if any of them will work for you now.

Refine The Solutions

You should now have a few possible solutions that might work. Before you go through the work to build models, refine your list of solutions by using a bit of simple capacity planning.

To answer the whole question you might need a model, but to figure out that a proposed solution doesn’t work can sometimes be seen with simple capacity planning of some small part of the whole question.  That sad news might get you to abandon the idea altogether, or it might lead to a modification of the original idea to make it better.  Use simple, rapid tools to find problems in your plan before wasting more time modeling a solution that can never work. Let’s look at an example.

We are planning to add the workload generated by the new XYZ transaction to a key system. There is no test data yet, but reasonable people agree that, due to its complexity, the XYZ transaction should take 2.5 times the CPU resources of the current BBQ transaction on that key system.

The BBQ transaction consumes 20ms of CPU per transaction, so the XYZ transaction will most likely use (2.5 * 20ms)= 50ms of CPU.  At peak load you are planning for 100 BBQ TX/sec and estimating that there will be 20 XYZ TX/sec. On this key system, just these two transactions will use three seconds per second of CPU as you can see below.BBQ table

If that machine is a two CPU machine, then this idea is dead in the water.

Seek Out Missing Information

There will come a point where you know most of what you need to know but are missing a key bit of information on throughput, service time, utilization, etc. Here are a few key performance laws that will help you find the missing piece.

Little’s Law: Mean#InSystem = MeanResponseTime * MeanThroughput (Example)

Utilization Law: MeanUtilization = ServiceTime * MeanThroughput (Example)

Service Demand Law: ServiceTime = MeanUtilization / MeanThroughput

These equations can always be rearranged mathematically (e.g. if A=B*C then B=A/C and C = A/B) to find the thing you are missing.

Double Check Your Work

As you are exploring these solutions, be sure to keep looking for a number that rubs you the wrong way or an assertion you can’t stop thinking is wrong.  Keep asking yourself questions like these.

  • Will this work at peak load?
  • Can that number really be that low?
  • Am I using the right metering data?
  • Did I mess up the units in this calculation?

A Note To Beginners

If this is your first time doing this, you are starting from scratch. Don’t make the mistake of doing that again next year. Save your report, your notes, and any tools you created to help you write this report.  Next time, you’ll have a good base to start with and build on. This will save you time and make sure you don’t forget things. As the time between peaks goes by, you’ll usually find new things to add to the model. Write them down, build them into your performance tools, and file them with all your other modeling data so you don’t forget them when you plan the next peak.


You’ll do this again. Always take time to make things easier
for your future self.
”  –
Bob’s Tenth Rule of Performance Work


For more info on performance modeling see: The Every Computer Performance Book  on Amazon and iTunes.


 

About Performance Modeling

I’d like to dispel the myth that performance modeling is too complex to be useful and show how modeling performance can save you a lot of time and money. Even though there are as many kinds of models as there are PhD candidates to dream them up, here we’ll just look at two wildly useful types of models: Capacity and Simulation.

Capacity Models

twitterCapacity models are just regular old capacity planning with a bit of a twist.  Instead of “find a utilization and scale it,” now there is more work to do.  In a capacity model you are redirecting the flow of work, adding new work, adjusting the mix of current transactions, or incorporating new hardware. There are more things to count and account for.

To do a capacity model you have to understand what you’ve got, and then figure out how to adjust the numbers to compensate for the changes you are modeling. It requires performance monitoring, capacity planning skills, simple math and an eye for detail.

Capacity models can do a lot, but they can’t predict the response time under load.  If you need to know that, then you need a simulation model.

Simulation Models

simulationSimulation models are a funny combination of an accounting program, a random number generator, and a time machine.   They simulate work arriving randomly at a pace you select and then simulate the flow of work though a simulated computing world by accounting for costs and delays at every step. They can run faster, or slower, than real time. They can skip ahead through time when they’ve got nothing to simulate at the moment. They give you throughput, utilization and response time information for any computing world that you can dream up. The only problem is that they sound scary.

I used to believe that simulation modeling could only be done by super-smart NASA engineers and was only reasonable to do in situations where things had to work the first time or people would be killed and millions of dollars worth of hardware would be destroyed. I used to believe that simulation modeling was incredibly expensive, hard, and time consuming. I was wrong.

I’ve found simulation modeling to be a useful, and important tool. In a previous job I taught modeling concepts and a PC-based simulation modeling tool to rooms full of regular people who worked for regular companies, doing the normal work of maintaining and improving commercial systems. From that I learned modeling is doable.  The stories I heard from my students about the models their companies relied on proved to me that modeling is useful.

Some Modeling Truths

Before I get into model building, here are some surprising truths about modeling…

Modeling Is Necessary

There are two kinds of important performance problems you can’t solve without modeling. You can’t do performance monitoring, capacity planning, or load testing on un-built systems, as there is nothing to test. You can’t use simple capacity planning or load testing to predict future performance on systems that are about to undergo radical transformations.

In both cases there is a bit of a chicken-or-egg problem as the company wants to know the cost of the hardware for the unbuilt or radically transformed computing world before it is built, but until you build/transform it, you don’t have all the data you need to make those projections. This is solvable.

All Models Are Wrong…

box

George Box once artfully said: “All Models are wrong, some models are useful.”  So, please take a moment and get over the fact that your model won’t generate a perfect result.

Nobody models to a high degree of accuracy because to get that you have to build wildly complex models that model every little thing. You have to put so much time into the model that the business is out of business before the model sees its first run. The 80:20 rule (see Pareto Principle) applies here.  A simple model can give you a ballpark answer. That is often more than good enough to green light a project or size a hardware request.

…Some Models Are Useful

Imagine an inaccurate model where you are guessing at many of the input parameters and unsure about the transaction mix or peak demand. You run the model and, even with the most optimistic assumptions, it forecasts that you’ll have to buy somewhere between two and five new servers. If the budget is closed for the rest of the year, then this useful model just saved you a lot of time on that sure-to-fail idea. The thing that makes a model useful is your confidence that it is accurate enough to answer your question.


If you have to model, build the least accurate model that will do the job.
                –
Bob’s Ninth Rule of Performance Work


 

Modeling Can Be Done At Any Time

All models can be built at different stages of a project’s lifecycle: design, testing, and production.  At each stage there is data available to build a model.

In the design stage, models are built on educated guesses, the results of quick tests, business plans, and other less than concrete data. They answer big scale questions like “How many servers will we need?” and are not all that precise.

In the testing stage, models can be built with better data as the design is fairly fixed and there is some running software to meter. Here you can ask small scale questions such as “Can both these processes run on the same server?” as well as big scale questions like “Will this configuration handle the peak?”

In the production stage the entire application and computing world can be metered and tested against. Here, with enough work, you can build a model to answer almost any question.

Models Can Be Built To Different “Resolutions”

At some point in every model you treat some part of your computing world like a black box. Data goes in, data comes out, and you don’t care about the exact inner workings.  Depending on the question you want answered, the model can treat any part of your transaction path as that mysterious black box. It could be as small as a single communications link or as large as the entire datacenter.

The higher the resolution of the model, the more costly and time consuming it is to build. Do not confuse high resolution with high accuracy, as they are not the same.  A low-resolution model can give you a spot-on accurate answer.  For example, if you are modeling your datacenter’s Internet connection, you don’t care what happens inside the datacenter, you care about how the bandwidth requirements will change as the transaction mix changes.

More to come…

There is much more to say about models, but that will have to wait for a future post.


For more info on performance modeling see: The Every Computer Performance Book  on Amazon and iTunes.

Metering Deeply To Answer Questions About The Future

You build a performance model to answer a question that capacity planning or load testing can’t answer.  Since you can’t test the future, you have to meter what you’ve got in clever ways and then mathematically predict for the future reality.

Performance Model Questions

The UltraBogus 3000 is a mythical state-of-the art computer of yesteryear whose total downtime has never been exceeded by any competitor.  It features fully-puffed marketing literature, a backwards-compatible front door, and a Stooge-enabled, three-idiot architecture that processes transactions with a minimum of efficiency.  Its two-bit bus runs conveniently between your office and the repair facility every Tuesday.

Let’s start with a couple of performance model questions: “Will upgrading to the new UltraBogus 3000 computer (pictured above) solve our performance problem?” or “If we move just the X-type transactions to the XYZ system, will it be able to handle the load?” You can’t meter or load test a computer you don’t have and the performance meters typically show the net result of all transactions, not just X-type transactions.

To answer these kind of questions, you have to know the picky detail of how work flows through your system and what resources are consumed when doing a specific type of transaction. Here are a couple of ways I’ve done this.

Finding Specific Transaction Resource Requirements

You can use load testing to explore the resource consumption and transaction path of a given transaction, by gently loading the system with that kind of transaction during a relatively quiet time. Send enough transactions so the load is clear in the meters, but not so many as the response time start degrading. Any process or resource that shows a dramatic jump in activity is part of the transaction path. Imagine you got this data on a key process before, during, and after a five-minute load test on just transaction X:

txgraph

Now do a bit of simple math to figure out the resource consumption for a given X transaction. For reads it looks as though the number per second increased by about 100/sec during the test, so it’s reasonable to estimate that for every X transaction this key process does one read. The writes didn’t really change, so no writes are done for the X transaction. First notice if the before and after consumption of resources is about the same. If so, it is a good sign that during the load test there was no big change in the user-supplied load. I’d recommend you repeat this test a couple of times. If you get about the same results each time, your confidence will grow, and the possibility of error is reduced.

For CPU, the consumption during the load test jumped by about 600 milliseconds per second and therefore each X transaction requires 600 / 100 = 6ms of CPU.  Overall, this test tells us that the X transaction passes through this key process, does one read, does no writes, and burns six milliseconds of CPU.

Please note, the math will never be precise due to sampling errors and random variations. This methodology can give you a value that is about right. If you need a value that is exactly right, then you need to run this test on an otherwise idle system, perhaps at the end of a scheduled downtime.

Finding The Transaction Path

You can use a network sniffer to examine the back and forth network traffic to record the exact nature of the interprocess communication. For political, technical, and often legal reasons, this is usually only possible on a test system. Send one transaction in at a time separated by several seconds, or with a ping, so there is no overlap and each unique interaction is clear.. Study multiple examples of each transaction type.

The following timeline was created from the network sniffer data that showed the time, origin, and destination of each packet as these processes talked to each other. During this time there were two instances of the X transaction, with a ping added so the transaction boundaries were completely clear. We can learn many things from this timeline.

trace

First we know the transaction path for the X transaction (A to B to C to B to A), which can be very useful. Please note, a sniffer will only show you communication over the network. It won’t show all the dependencies with other processes and system resources that happen through other interprocess communication facilities. This give you a good place to start, not the whole map of every dependency.

We can also see exactly how long each process works on each thing it receives. Process C worked on each X transaction for about 200 milliseconds before responding. Little’s law (and common sense) show if you know the service time you can find the maximum throughput because:

MaxThroughput  ≤  1 / AverageServiceTime

Since this was an idle system, the wait time was zero, so the response time was the service time. If we do the calculation and find that 1/.2 = 5, then we know that Process C can only handle five X transactions per second.

If the question for the model was: “If we don’t reengineer Process C, can we get to 500 X transactions/sec?” The answer is no. If the question for the model was: “How many copies of Process C do we need to handle 500 X transactions a second?” The answer is going to be at least 500/5 = 100, plus more because you never want to plan for anything to be at, or near, 100% busy at peak.

When it’s time to build a model, often you’ll need to know things that can be difficult to discover. Share the problem widely as others might have a clue, a meter, or a spark of inspiration that will help you get the data. Work to understand exactly what question the model is trying to solve as that will give you insight as to what alternative data or alternative approach might work.

While you’re exploring also keep an eye on the big picture as well. Any model of your computing world has to deal with the odd things that happen once in a while. Note things like backups, end of day/week/month processing, fail-over and disaster recovery plans.

This post was built on an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.