Load Tests: Failures To Avoid and Opportunities to Seize

In no particular order, here are a few final things that have worked well for me and some lessons I’ve learned while load testing.  Examine each one. Think about them in the context of your organization, your personality, your duties, and the daily challenges you face.  Use what works and discard the rest.

Load Testing Is a Team Sport

teamLoad testing requires the help and cooperation of many people. This is usually extra work done at very inconvenient times. Those people can be ordered to cooperate, but they give you so much more if they understand what’s going on and get treated with a little respect. Here are a few good things to do:

  • Ask for their help.
  • Give plenty of advance notice.
  • Explain what you are doing with the test and inquire about any unforeseen difficulties the test might create, or opportunities the test might offer if a small change is made.
  • Bring food.
  • Share your results after the test.
  • Say ”Thank You” and mean it.
  • Pay it forward by always being helpful to others.

Reality vs. The Lab

It’s vastly easier to test the application in its native environment than it is to recreate and test a copy of it in a lab on rented equipment. If you have to go into the lab, expect many delays as you discover just how fussy a computer program can be about the computer it runs on, its surroundings, the placement of files, and network connectivity.

Lab-based Load Test Failures I’ve Seen More Than Oncelab

The team showed up with almost everything they needed. Pack carefully for the trip.

The team showed up with unfinished or poorly tested software. Don’t plan to fix it in the lab.

The team assumed that key equipment would be there. They assumed the right version of software and hardware would not only be available, but already be installed for them. They assumed an expert would be available at a moment’s notice. Don’t assume, check.

The team finally ran the load test successfully after several hardware and configuration changes. Sadly, no one recorded key details of the configuration that finally worked, so the results were meaningless.  Take careful notes and make load tests as self-documenting as possible.

The team did not plan how they would reset the system quickly between tests, so they spent 75% of their lab time resetting for the next run. Plan for multiple runs and do what you can to speed up test resets.

Use Checklists

checklistThe smallest forgotten detail can ruin a perfectly good load test. Have a checklist and use it. Checklists work for NASA, for surgeons, for pilots, and for load testers.

Ironically, all checklists are incomplete. When you find something missing from the list, take the time to update the master copy of the list.

Communicate During The Test

A full-power peak load test has the potential to be a disruptive event. Before the test starts, open a voice bridge so that all interested parties can follow what’s going on and report troubles quickly.  Check in with everyone periodically at key points. For example:

  • Before testing starts ask everyone: “Are you ready?”
  • When test is first running at low power: “Do you see that load in your meters?”
  • As the load is ramping up: “Any problems?”

Time is of The Essence.

timeEverything you can do to save yourself time, and everything you can do to speed up and automate your testing, is a good thing to do. This allows you to do more testing in a given amount time and to rapidly diagnose problems.  During the test many people will be impatiently waiting for you.

Spend time before the test preparing your tools to:

  • Start the metering and then the test
  • Check if the test and the metering are running as expected
  • Rapidly capture every possible clue when the test fails
  • Reset things to a known state after a run
  • Do rapid performance analysis of the metering data looking for problems and bottlenecks

Clean Up After Yourself

trashThis will not be your only load test. Make sure you have time in the schedule for cleanup.

If you are in your own internal lab, or some vendor’s lab, leave it nicely picked up when you are done testing. The people who will have to clean up your mess are also the people who will help you set up for the next test.

Someday, you’ll be back in that lab and need their help again.

Sharpen the sword

After you do a big load test, you should take time to review your notes, logs, and memories looking for things that worked well and things that did not. Improve your tools for next time.  This is also a good time to spend time publicly thanking those who helped you.

Load Tests: Analyzing The Results

Every load test tool is different, but the goal is always the same… to create an artificial load that you believe will demonstrate that your computing world can handle the work.

crowdHow do you come to have faith in your load test?

First, you validate that your load test brings a load that fairly represents a normal, not-too-busy day. If you are testing the live system, run the test at a time when the real user load is very low. Adjust the load test so that you are bringing enough work into the system to emulate a moderate load on a not-too-busy day. If all the performance, response time and throughput meters you have look remarkably similar to what you’d expect a real, live, user load to be at that intensity, then you’ve validated your load test.

Validation Tests

Below is an example where a load test was run at moderate load of 500TX/min, late at night when few real users were on the system. The load is turned on and off three times so that we can clearly see the background load that the sleep-deprived late-night users were adding and we can see if our results are repeatable. The chart below shows the overall CPU busy for a key machine.

vtest

In the first two tests, the transaction load and the CPU utilization moved together nicely. In the third test, the CPU busy started moving upwards before the transactions were sent, so something else was asking this system to do work.  A short investigation might show that some unusual, but explainable, activity caused this We can ignore that third test and then judge if, at 500TX/min, this was a normal amount of CPU consumption compared to the live load.  The meters captured during the validation test and when the live user load is on the system do not have to match perfectly. There will always be some noise in the data.

Looking at CPU consumption is a good place to start your validation work, but check your other performance meters as well. See how all the meters match up during the validation test.  If some of the meters don’t seem to make sense, there are several things to check, improve, and adjust:

  • Make sure all the transactions you are sending in are getting valid responses, not error messages.
  • Perhaps there is not enough variability in a given transaction. Searching for the word “cow” a million times in a row is not the same as doing a million searches from a randomly chosen list of a thousand different words.
  • If you can’t seem to drive the transaction rate high enough, perhaps you are having a locking issue. For example, simultaneously updating the same user record from many virtual users can prevent the test from scaling up as everyone is waiting for everyone else.
  • For load tests with a complex transaction mix you might want to test each transaction type separately, so you know it is working as expected.
  • Perhaps you need to script additional transactions and add them to the workload.
  • Perhaps you need to adjust the ratio of the different transaction types in the workload mix.

Keep testing and refining your load until it looks right, runs without error, and returns result that are close enough to what you see on the live system. This takes time, but over time you build confidence in your load test.

Analyzing Load Tests

With your test validated, now you are ready to try a full-power load test. This is a big event as you are going to push your computing world hard. Things might break, performance automated alerts will go out, dashboards will turn unhappy colors, and, if this is a test on the live system, user performance will suffer. Be sure that everyone is informed well before the test and has an easy way to communicate during the test.

Start the load test at the level you’ve previously validated. Run at that level long enough so that you have time to check in with the key players to see if their meters are running and everything is as expected. If all is good, then ramp up your load over time until you get to your goal and run at the goal transaction rate long enough to get multiple samples of the internal meters.

Below is a load test that was initially validated at 100 virtual users, with a goal load of 625 virtual users.loadtest

This is a happy load test as the average response time stayed nice and level except for one data point at 11:27. Since everything else looks good, you might choose to ignore that and declare a success. However, when you present your results, it is very probable that someone in the meeting will be fascinated with this odd result. You could rerun the test and hope the event was a one-time thing, or you could do the right thing and figure out what caused that spike in response time.

Below is an unhappy load test, as the more work you brought to the system, the worse the response time got. At 100 virtual users, the average response time was 0.7 seconds. Once we got to 150 virtual users the response time started climbing. That is never good.unhappyload

When setting goals for the load test, you should have well defined upper limits for response time and number of errors.  Typically the response time goals are a very small multiple of normal low-load response times.  So if normal here is 0.75 seconds, then you might have a peak-load goal of no worse that 2X that number, or 1.5 seconds. Your boss will give you guidance on the ceiling for this number.

Oddly enough, response time can go down under increasing load. This is never a good thing because this only happens when something is failing. It is often faster to fail, than it is to do all the work the transaction requires.

Below is a graph of a bad test where the response time climbs and crashes over and over. This pattern is an artifact of this particular load generation tool. Once this tool hits a certain error threshold, it will reduce the load until the errors subside and then it will ramp up the load again. Regardless of the particular pattern, if you see response time dramatically improving as the load climbs, something is wrong. Start looking for what is failing.badtest1

Below is a different chart from the same bad test that shows transactions completed and average response time for those transactions. Notice the lines crossing again and again. The low points are where transactions are failing quickly.badtest2

Every load test you ever build or buy will have unique ways of presenting the results and detecting and dealing with errors. Before your first real load test, you should explore the tool, see how it behaves, and understand exactly what the meter is telling you.  For example, if the tool gives you a number labeled “transactions”, are they started, or completed, transactions? If the tool gives you a number for virtual users, is that the target number or the actual number? Is that number the average during the sample period or the total number at the beginning of the minute? These fine distinctions make a big difference in how much you can deduce from any meter.

After The Load Test

Once the load test finds a point where the response time gets bad and lots of errors are happening, stop and study the results. Let’s say your computing world ran out of gas at about 1000TX/minute, and your goal was to get to 1600TX/minute. It can be helpful to run an additional test where you approach the failure point in stages, pausing at each stage long enough to give the internal meters time to get several good samples.  So your test might run with an unchanging load for five minutes at 800, then jump to 900, then jump to 1000 TX/min.

Your internal meters at 800 TX/min are approximately half as busy as they will be at your target transaction rate of 1600TX/min. This is a good opportunity to do some quick capacity planning and see what you are likely to run out of. Doing this means you get more information out of each load test, and that means lower costs and less 3am load test runs to wake up for. Running the test at 900 and 1000TX/min allows you to double check your work.  I’ve seen instances where a resource that looked like it was going to be the bottleneck did not increase in proportion to the additional load.


For more info see: The Every Computer Performance Book, on Amazon and iTunes.

 

Load Tests: Setting The Goals

darts

A load test either achieves a certain goal, or it dies trying.  If your computing world crumbles half way to that goal, then your performance meters should contain some good clues as to what to fix, reengineer, or upgrade before you try again.

The goals for the load test are often based on what your computing world has handled in the past, plus an increase for the projected growth.

All load test goals boil down to just a few questions. How hard are you going to push your computing world? What response time is acceptable?  What error rate is acceptable? What internal meters will you collect?

How Hard To Push

If what you are testing is user visible, then you may frame your goals for how many virtual users will be simultaneously supported.  In real life, each user will submit work at a given pace and, for multi-step transactions, usually wait between steps as they read and think about the results of the previous step.  This leads to two different ways of counting virtual users: concurrent and simultaneous.

Concurrent virtual users are connected to the system and are requesting work at some regular interval. Simultaneous virtual users are all requesting work at the same time.

barIf your computing world was a bar, then the concurrent virtual users would be all the patrons in the bar, and the simultaneous virtual users would be the patrons who are currently asking the bartender for a drink.

If you were running a load test as a stress test where the wait time was zero, then concurrent virtual users equal simultaneous virtual users as every patron would be guzzling each drink served and immediately requesting a new one.

If your boss wants the goal stated in terms of the number of simultaneous transactions, or users, then all you need to do is set the think time to zero and use enough virtual users to match the goal number, plus a few extra.  What are the extra for?  Once a virtual user finishes a transaction, it takes some non-zero time to report the results and reset itself to go again.  You can figure out the right number of extra virtual users when you are doing the low-power test validation testing.  The shorter the transaction is in relationship to the reset time, the more extra virtual users you will need to achieve your goal. TX bar

In the example above, the average transaction time and the reset time are both about the same. Each virtual user will only spend half of its time keeping your computing world busy.

If your boss wants the goal stated in terms of number of concurrent virtual users, then just start that number of virtual users plus a few to handle the resetting and reporting downtime, as mentioned above.

When defining the goal you might not have a way to directly measure the number of concurrent virtual users. In that case you can estimate the number concurrent virtual users in your computing world if you know the average number of sessions (i.e., the time a user is considered concurrent) and the average session duration, using Little’s Law:

 ConcurrentUsers = NumOfSessions * AvgSessionDuration

First find the total number of sessions in a peak hour. Then convert the average session duration to units of an hour.

    0.05 hours = 180 seconds / 3600 seconds per hour

Then multiply these two values together to find the number of concurrent virtual users.

Here is the same calculation applied to the physical world.

fastfoodSuppose you are building a fast food restaurant and you want to serve 200 people per hour at peak. You also know from previous experience that the average diner spends 15 minutes sitting at a table in this type of restaurant (0.25 hour session duration). How many chairs will you need? 200 * .25 = 50 chairs  Those chairs represent the concurrent users, or in this case, concurrent diners.

If your boss wants the goal stated in transactions per second then your job is easy, as during the low-power test validation testing, you’ll get a good idea of how many virtual users you need to generate a throughput of X Transactions/second.

Response Time

Measuring throughput without looking at response time is foolish. If you allow infinite response times, then any computer can handle any load.  When monitoring response times in a typical load test, you will see this normal progression as the load increases:

  1. At a low load the response time looks fine.
  2. As you increase the load, at some point the response time will start to climb as a key resource bottlenecks.
  3. If you push hard enough the response time will either keep climbing, or start dropping as transactions fail.

The odd thing about response time is that sometimes it is faster to fail, than to succeed. It can be much faster to return an error like: “Zoiks! Database lookup failure” than perform a long, complex query. Once an application is warmed up (processes started, key files in cache, etc.) I’ve never seen a case where adding more load improved response time. My rule for performance work is: If the response time is improving under increased load, then something is broken.

For load test goals you need to define an upper limit for acceptable response time.  Once you hit that number, then it is pointless to push your computing world harder.  You may want to give some thought to how you specify that number. Response time matters a lot to users and how you specify the number will determine the acceptable number of suffering users at peak.  You can specify response time as:

  1. No response time will exceed X seconds.
  2. The average response time will not exceed X seconds.
  3. 95% of transactions will take less than X seconds.

The first option (“No response time will…”) is very strict and will cost you lots of money to buy all the additional hardware needed to keep the response time for every transaction under this limit.  Also, if part of your transaction path crosses the Internet, then that part of the path is totally out of your control. A former colleague often says:
“Bad things happen on the Internet”.

The second option (“The average…”) is much easier to hit, but it has a problem in that many users will still be suffering.  Depending on the distribution of response times, this could be half of your users. That’s a lot of unhappy users to plan for.

The third option (“95% of transactions …”) is your best bet.  This lets you run your computing world harder than the first option and allows less unhappy users than the second option. If you prefer, you can pick a different number than 95%.  Some people like 98%, some people like 90%.  Just as long as the boss accepts this number, all is well.

Quality of Results

failwhaleYour load test tool has to have a way of evaluating the quality of responses it gets back from your computing world.  Simulating high volume user load can be a tricky business fraught with error.  Applications will break under load, security checks will start getting in the way, third parties will have problems, and bad things will start happening on the Internet.  The ultimate goal of any load test is to have your computing world smoothly handle a simulated peak load. The term “handle” does not mean to return useless nonsense in a fast an efficient manner. Your load test tool needs to check the retuned results by looking for problems, error messages, and/or missing data.

Internal Meters

Be sure to collect all the internal metering data you can. During a load test, the load can be held very steady. You can choose the exact mix of transactions, so performance meters can be calibrated (At 1000 TX/sec the bamboozle/sec meter = 4000), and explored (transaction X has twice the effect on this meter a transaction Y), with greater ease and precision than when metering a live user load.

Load Testing: Different Approaches

tuningThe way you tune the settings of your load test is determined by what questions you need answered. It is not just about the next big peak. There are several different kinds of load tests: Load Testing, Isolation Testing, Stress Testing, and Endurance Testing. Each one of these can tell you interesting things about your computing world.

Load Testing

A load test applies a realistic external load that simulates the anticipated peak to directly measure response time, throughput, and other key internal performance meters. A load test can show you under what load the response time and throughput will start to get ugly.

A load test can also validate your capacity planning efforts. For example, when the capacity plan you just completed says your site can handle 500 things/second:

  • The load test might run just fine at 500 things/second, and all key resources report the predicted utilizations and throughput rates. All is well.
  • The load test might run just fine at 500 things/second, but the utilizations of key resources were well below what the capacity plan projected. You better double check the math on your capacity plan and double check to make sure that the load test ran as expected.
  • The load test might only get to 340 things/second before some unexpected resource bottlenecks. Fix that bottleneck, add that resource to the capacity plan for next year, and load test again.

A load test can also allow you to calibrate your meters, and other performance information gathering tools, under a very stable load. For example:

  • At 100 TX/sec the bamboozle/sec meter = 400
  • At 500 TX/sec the bamboozle/sec meter = 2000
  • At 1000 TX/sec the bamboozle/sec meter = 4000
  • So now you have multiple data points showing that each TX (on average) generates four bamboozles
  • Now you can use the bamboozle/sec meter any time as a workload meter by dividing it by four, even though you have no clue what a “bamboozle” is.

Isolation Testing

In a load test you control the mix of transactions, so you can feed your computing world a pure stream of only one kind of transaction, rather than the mix of transactions the users usually generate. This can be a useful way to explore your computing world.

First, notice if there are any systems or resources that you did not expect this type of transaction to use?  If a performance meter surprises you, you have more work to do.

If you are hunting a problem, or checking to see if some change to your computing world made a difference in performance, isolation testing can help.  By testing the major transaction types separately, you might find that only the X transaction has dramatically slowed down or that only the Z transaction causes the problems you are seeing.

An isolated test of a transaction made of distinct parts (e.g., a web-based transaction that visits the home page, searches, and then puts that thing in a cart) can show you clearly which part of the longer transaction is having performance troubles under what load.

webtx

Since you ran this transaction in isolation, the metering data you are getting from your computing world is only showing work generated by this kind of transaction. That makes it easier to find the bottleneck. Clearly from the table above the Search part of this transaction is having some problems at 200 TX/sec and is really hurting at 300 TX/sec.  The performance meters during this load test should give you a big clue as to where the problem lies.

Stress Testing

A stress test is just a load test where you purposely overdrive the system to find its breaking point.  This is done by running a load test with a normal transaction mix, but with way too many users and/or no think time.

scrum

If you run your load test and you achieve your goals it is still useful and interesting to push the system to its breaking point to see exactly how it breaks – so you know it when you see it. Would you want to go to a doctor who had studied medicine carefully, but never seen a really sick person?

Endurance Testing

This is a load test where you study if everything in your computing world can keep running over time, not just for a few minutes.  Typically, the load is well below the peak load, and what you are looking for are things that you can run out of or that don’t scale well.  Common questions to look at are:

  • How much disk capacity is consumed per transaction?
  • How much memory is leaking per transaction?
  • Is there an unknown hard coded limit in the software?
  • Is throughput and response time just as good after the millionth transaction (when files are bigger, databases age, and sorting algorithms are challenged) as it was for the early transactions?

These tests are important to run before software is put into production as stopping and fixing problems in the middle of the day on the live system is usually not an option the company wants to take.

Generating The Load

Champion-Power-Equipment-46533-4000-Watt-196cc-4-Stroke-GasSome collection of computers and software generates the incoming workload for your load test and evaluates if the work is being handled successfully.

Depending on your situation you might build it yourself, or have some company generate the load for you, typically via the Internet.

Here are some things to look for when evaluating your options.

Location

Where you generate the load matters.  The load should flow though as much of your computing world as it would normally. Any part of your computing world that you do not test is, by definition, untested. That untested part is likely to keep you up at night worrying and surprise you, in an unpleasant way, during the peak with its shocking lack of throughput.

If you are doing a stand-alone load test of a small subsystem, then the generated load should come from outside the tested computer(s). Why? First, it takes resources to generate load, and you’d like a clean set of performance data from the tested system(s). Also, if you generate the load on the tested system, then you are not testing the network connections though which the real load will have to flow.

If you are doing an end-user load test, then the load should be generated outside your company and from the locations where your users live. Distance matters on the Internet.

Ease Of Use

The sales pitch for the load test tool will tend to focus on the beauty and the flexibility of how it displays results.  That’s all good, but you’ll spend a lot more time creating and debugging the load test than you will spend running and evaluating it. When selecting a load generation tool carefully note:

  • The ease with which you can create new transactions and modify existing ones.
  • The quality and clarity of diagnostic info you get back when transactions are failing.
  • How easily and rapidly the tool can schedule, stop, and restart tests. Load testing is a team sport. Making people wait for you, and the load testing tool, is never fun.
  • How close to real time do you get the results for transactions started and completed, transaction response time and failure rate. You want the bad news as soon as possible, so you can stop, fix, and restart the test.

Money

moneyGenerating load costs money.  More load, more money.

Budget for the testing you’ll have to do before the big load test, and plan to work though several failures where you have to stop the test, fix something, and restart it.

When To Load Test

When driving, it is best to start applying the brakes when you have enough time to easily avoid disaster. In load testing, it is best to start your efforts when you have enough lead-time to fix any performance problem you uncover. That amount of time is different for every situation as there are many things that will influence your decision as to when to get started.

All organizations have a pace at which they feel comfortable. This is especially true when spending money. A good first question to ask yourself when it looks likely that you’ll have to spend to get though the next peak is “How long did it usually take between decision and delivery of the last few major IT purchases?” Every company is different, and I’ve personally seen behavior ranging from 30 days to almost two years. You need to do your work with enough lead-time to take this into account. Don’t get me wrong, change can happen faster than this, but it is just a lot more pleasant for everyone involved if you take into account the company’s normal pace.

There are many other factors that influence when to test. Weigh each one as you choose the best time to do your work.

  • The annual pre-peak hardware/software freeze
  • The anticipated dates for big infrastructure changes
  • The anticipated dates for the roll out of new websites, applications, or features
  • The last quarter of the fiscal year for key vendors
  • When money is available in your budget

As you will see in the list above, when taking all factors into account, there might not be an ideal time to do the load test.  In that case, pick your battles and begin your work.  With time, all things are possible. In general, earlier is better.


For more info see: The Every Computer Performance Book, on Amazon and iTunes.

Load Testing: Creating and Validating The Load

subwayTo create a good load test you not only have to figure out what specific tasks the users are asking your system to do, but also the right mix of tasks (e.g. ten withdraw transactions for every deposit transaction), and what rate you want them delivered to emulate the peak load.

All of this will take some creative application performance metering and some discussions about which transactions to emulate. Let’s tackle these problems one at a time, but first a brief word about abandoning your quest for perfection.

Good Enough Is Just Fine

Give up on the idea of perfection.  There is no “perfect” in load testing, as the users are always changing their behavior and you will never emulate all the different transactions with all the possible user choices.  Why?

Unlike capacity planning, where you can do your work by yourself with a spreadsheet and some metering data, load testing costs money, often requires the help of others, and can be disruptive. Also, load testing is usually done at an hour that is inconvenient for everyone involved. All of this tends to push back hard against perfection.

You are looking for “good enough” to get the job done.   There are many choices you’ll have to make and guesses you’ll have to take, so how do you know the load test you’ve designed is good enough?

Test Validation

When you’ve designed and built your load test, run it at a normal everyday load and see if it works without errors and returns performance meter values that are similar to the ones you get on an average day.  This is known as test validation.

For example, at noon on a pre-peak day your computing world is handling 50 TX/sec, and a key machine is around 20% busy. At midnight that machine is almost idle. You are planning for an upcoming peak that is four times (4X) the noon peak. If your load test reasonably emulates the real user load, then when you run your load test at midnight, sending in 50 TX/sec, the key machine should show 20% busy, and the other key meters in your computing world should also resemble their noon time values.  If half your computing world is idle under this test, clearly you’ve got more work to do. If the numbers are close enough, then your test is good to go.  But, what’s close enough?

Begin with the peak in mind.  The peak you are planning for is 4X the normal noon load of 20% busy.  Any differences in the meters between the real user load and the midnight load test will be 4x greater at the peak, and that means little unimportant differences can become big significant differences. Let’s look at the numbers on three key meters…

loadtable

Meter X matches up nicely. If we were emulating the real user load perfectly we’d expect meter Y and meter Z to match up nicely as well, but they don’t.

Meter Y is much busier at noon than during the midnight load test. Here it makes a difference, as we expect the peak to be 4X the measured day. So, just using basic capacity planning math, two things are clear:

  1. The resource watched over by Meter Y will bottleneck at peak as: 30% * 4 = 120%
  2. This is a difference that makes a difference, as Meter Y during the load test only showed a utilization of 20% busy and that works out to 20% * 4 = 80% at peak.  Meter Y tells us we have more work to do on this load test.

Meter Z is a little off between the noon and the midnight numbers, but this is a difference that you could live with because, even when you scale up the larger sample (10% * 4 = 40%), it’s clear this is not going to be a bottleneck.

You can also do this checking with any meter that counts things of importance to you like packets, IO’s, thingamajigs, whatever. Once the results are close enough, you can trust that your load test will do a good job pushing your computing world as hard as the users will at peak. Now, let’s design a load test.

Selecting Transactions To Emulate

Your computing world handles many different types of transactions, but the bulk of the workload, and/or the bulk of the revenue, comes from just a few of them. It is also the case that many transactions have important differences for users but are computationally identical for your computing world – the bits flow through the same processes and consume the same resources.

Start building a list of transactions to emulate. First add the ones that make up the bulk of your workload. Then add any transactions that bring serious money into the company even if they are not all that numerous.

To a corporation, nothing is more important than money. Follow the money.
                –
Bob’s Seventh Rule of Performance Work

Then add any transactions that have recently caused you trouble and are thus politically sensitive at this time.  Now look that list over and, if it makes things simpler, you can combine transactions that are computationally similar into a generic transaction – the buyX, buyY and buyZ transactions get grouped together into the generic buyStuff transaction.  This will typically leave you with a short list of transactions.

Scripting Transactions

Users are not identical robots typing the same things over and over at machine-like speeds.

robotrobotrobotrobot

Users are unique, they pause to consider and to choose, and they do different things.  There are constraints (logical, legal, and practical) as to what your users can do. You can’t login simultaneously from two different cities, you can’t withdraw with a zero balance, and you can’t put a trillion things in your shopping cart. Whatever generates the load for your load test has to be able to handle that. Look for load generation tools that have a way to:

  • Record a script of actions for each type of transaction you decide to emulate
  • React to information presented during the transaction, such as an item out of stock
  • Determine if the response indicates success or failure (“Deposit accepted” vs. “D603 Database error…”
  • Add variability to that script by doing different things on each visit
  • Authenticate themselves so your security software will allow these transactions
  • Build think time between steps of a transaction to emulate the behavior of real users as they pause and consider

Scripting transactions takes work.  First you get it to work once, then you add variability into it (different users doing different things), and then you find security or reasonableness checks you have to either deal with or work around. For multi-step transactions (e.g., login, shop, buy), you need to test at each step to see if it was successful and to “report & abort” the transaction if it was not.

Typically, you work on one transaction at a time, testing it over and over until you are satisfied. Then you create and test the next transaction.  When you’ve got all the transactions in your load test working and tested individually, then you test them together, typically at low-power, to convince yourself that they all work together smoothly. Look for the results you see in the meters to match up nicely with the meters you get from the normal load generated by real users.

You Can Do This

All this work might sound overwhelming, but lots of regular people, no smarter than you, do it every day. The key is to begin somewhere, with some load generating tool and some transaction and build from there. At Ben & Jerry’s scoop shops they sell something called a Vermonster. A sundae with 20 scoops of ice cream, hot fudge, bananas, cookies, brownies, and all of your favorite toppings. How to you eat such a thing? One spoonful at a time with the help of others.

vermonster

Load Testing

Why Load Test?

There is nothing quite as reassuring as watching your computing world smoothly handle a workload of synthetically generated transactions long before the real peak arrives. load

A load test allows you to identify future bottlenecks and buy yourself time to fix things and to save money. You could just wait for the peak you are planning for to come naturally and hope for the best.  If it works, then all is well. If not, then you have no time to fix anything, no time to test the fixes, and only the most expensive options to fix things remaining on the table. The company will have to throw money and hardware at the problem and endure significant risk with untested workarounds. If performance is bad, there will be additional costs as customers move to your competitors, and the call center spends a lot more time saying “We’re really sorry.” to the customers who bother to call.

Capacity Planning Is The Hypothesis, Load Testing Is The Experimental Proof

Load testing can increase your confidence in your capacity planning efforts and show where you have way too much of a given resource, both of which can save your company serious money as you may be able to narrow the safety margin you use. Load testing can also help you tease apart the complex transaction mix generated naturally by the users so you can study the performance of each separate type of transaction.

No Perfection, But That Is OK

For those of you who believe that no load test can perfectly emulate your user load, I’m here to say you are absolutely correct. However, you can create load tests that are good enough to tell you lots of useful things about how your computing world responds to a heavy load. You learn something, and inch closer to the answers you need, with every load test you run… even the ones that fail miserably.

Lots More To Say…

Load testing is a big subject, so I’ll divide it over multiple future posts that focus on:


This post started as an excerpt from The Every Computer Performance Book, which you can find on Amazon and iTunes.

Hints On How To Meter Load Tests

Load testing is where you give your computing world an artificially generated workload to see how well it holds up.

planes

To meter the test itself look at your computing world with your normal meters that you run every day. During the test, look for two things.

  1. Did the load test look a normal load from real users?
  2. Did the load test hit a problem?

Looking For Normal

You know what the resource utilization looks like for your computing world under a moderate load. Do those numbers look similar to what you see during a moderate simulated load?  If key resources are unusually idle or busy during the load test then the load test itself is not doing the best job of replicating a user-generated load. High precision here is not required. If the load test load is close, within 10-20% of the expected user-generated load, that will usually do. However, remember queuing effects. The higher the utilization, the more precise you have to be, because at high utilizations a small increase in utilization can cause a big increase in response time.

Load tests have goals they want to reach that are usually expressed in terms of number of things/sec the system has to handle. Unlike normal user load, a load test generated load starts, stops, and changes at your command. When metering this, you should adjust your metering frequency to capture multiple samples at key parts of the load test and adjust your meter start time so it’s nicely synchronized with the load test. In general, you want your meters to start sampling at the top of each minute and be running before, during, and for a while after the load test. The before and after data can help you determine if anything unusual was happening that might have skewed the results.

If the load test achieved its goals, that’s great. Report how your computing world handled the load with a focus on any resource that looks like it is close to bottlenecking. Point out queues that are getting long, devices with a high utilization, resources that are close to running out of capacity, etc. Since load tests typically don’t simulate all transactions, they tend to under-utilize things. If some resource is close to a limit, you still might recommend adding more just to be safe.

If the load test failed to make its goals, that’s unfortunate. Report how your computing world handled the load with a focus on what bottlenecked during the test and what will bottleneck when the test eventually hits its goal load. How does that work?  Let’s look at the data.

The table below has the results of a load test. The goal was 800 TX/sec. Everything was fine at 200 TX/sec, but the test bottlenecked at 400 TX/sec. It is clear that device-Y (at 94% busy) will limit any further significant progress towards our goal.

loadg2

However, to reach our goal of 800TX/sec, our load test will eventually have to push twice the number of transactions per second through the system. Therefore it is reasonable to assume that every other device in the transaction path will have double the utilization. Take all the utilizations you measured at 400 TX/sec and double them to see what else we would run out of at 800TX/sec.

loadg3

Device-X will be at 100% busy at your projected peak. That’s never a good thing. Noticing two problems with one load test is smart testing. It will save you time, money and embarrassment. I’d add more device-X and device-Y before I tried another load test.

Looking For Problems

When metering a load test, sometimes there will seem to be a dramatic increase in efficiency where the load test is pushing a lot more transactions through your computing world with dramatically lower resource utilizations. I’m sorry to have to tell you, but this is always bad news.

Computer programs never suddenly get more efficient under a heavy load. What they do is start failing and sometimes it can be faster to fail than to succeed. Returning a simple error message is faster than providing a complex answer. Keep an eye on any reported errors and note any dramatic increase in errors as the load increases.

This post is from my book The Every Computer Performance Book, which you can find on Amazon and iTunes.

As Things Get Busy

As the user load increases from light to crazy-busy there are a three things I’ve noticed that generally tend to be true over every customer system I’ve ever worked on.

Resource Usage Tends to Scale Linearly

Once you have a trickle of work flowing through the system (which gets programs loaded into memory, buffers initialized and caches filled), it has been my experience that if you give a system X% more work to do, it will burn X% more resources doing that work. It’s that simple. If the transaction load doubles, expect to burn twice the CPU, do twice the disk IO, etc.

There is often talk about algorithmic performance optimizations kicking in at higher load levels, in theory that sounds good. Sadly, most development projects are late, the pressure is high, and the good intentions of the programmers are often left on the drawing board. Once the application works, it ships, and the proposed optimizations are typically forgotten.

Performance Does Not Scale Linearly

Independent parallel processing is a wonderful thing. Imagine two service centers with their own resources doing their own thing and getting the work done. Now twice as much work is coming, so we add two more service centers. You’d expect the response time to stay the same and the throughput to double. Wouldn’t that be nice?

The problem is that at some point in any application all roads lead to a place where key data has to be protected from simultaneous updates or a key resource is shared. At some point more resources don’t help, and the throughput is limited. This is the bad news buried in Amdahl’s Law – which it something you should read more about.

When All Hell Breaks Loose Weird Things Happen

At very high transaction levels many applications can suffer algorithmic breakdown when utterly swamped with work. For example, a simple list of active transactions works well under normal load when there are usually less than ten things on the list, but becomes a performance nightmare once there are 100,000 active transactions on the list. The sort algorithm used to maintain the list was not designed to handle that load. That’s when you see the throughput curve turn down.

tput

This can happen when a source of incoming transactions loses connectivity to you and, while disconnected, it buffers up transactions to be processed later. When the problem is fixed, the transaction source typically sends the delayed transactions at a relentless pace, and your system is swamped until it chews though that backlog. Plans need to be made to take into account these tsunami-like events.

For example, I’ve seen this at banks processing ATM transactions, where normally the overall load changes gradually throughout the day. Now a subsidiary loses communication and then reconnects after a few hours. That subsidiary typically dumps all the stored ATM transactions into the system as fast as the comm lines will move them. Building in some buffering, so that all the pending transactions can’t hit the system at once, can be a smart thing to do here.

Sometimes the tsunami-like load comes as part of disaster recovery, where all the load is suddenly sent to the remaining machine(s).  Here your company needs to decide how much money they want to spend to make these rare events tolerable.

Oddly enough, there is a case where throughput can go up while response time is going down under heavy load. This happens when something is failing and it is never a good thing. How does this happen? It is often faster to fail, than it is to do all the work the transaction requires. It is much faster to say “I give up.” than it is to actually climb the mountain.

bad test

In the graph above, you see the system under heavy load. When throughput (transactions completed) suddenly increases while the average response time drops, start looking for problems. This is a cry for help. The users here are not happy with the results that are receiving.

For More Information:

There is more insights, hints, tricks, and truisms that I gathered over my 25+ year career in performance work in my book: The Every Computer Performance Book

A short, occasionally funny, book on how to solve and avoid application and/or computer performance problems