A Thousand Thanks

Late last night some nice person bought the 1000th copy of my book. Wow!1000thankyous

I’d like to thank each and every one of my readers. When I started this book I did it to give back to the field of performance analysis. I’d hoped to sell 500 copies. Why 500? Since I can’t find everyone who might need to read this book, I decided to be happy with 500 as that is the number of students I’d have in an average teaching year when I was teaching the performance course at Stratus. Apparently I aimed too low.

The best part of this experience has been hearing from my readers. Their personal messages to me, comments on this blog, and recommendations of my book have been a joy to read. Yes, even the comment that kindly pointed out my typo in chapter eight where I wrote “pubic speaking” when I meant “public speaking.”

 

How To Fly

During my career I flew over a million air miles. Even though I’m just shy of seven feet tall, I mostly enjoyed the experience once I figured out a few key things about flying. I hope these insights help you. planes

Attitude is everything in flying

People are attracted to (and want to help) grateful, kind, and pleasant people. Think about your own life. When you have served others, what kind of person did you bend rules and go the extra mile for? When a problem happens, it is very rare that the ungrateful, unkind, and unpleasant person gets to their destination any faster than the kind person.

Consider the alternative

Regardless of how many things go wrong on a trip, flying is so much better than a bus. Until transporter technology is perfected, flying (even with all its hassles) is really your best high-speed choice for transport.

Air travel is like prison, but in a good way

Flying (especially after 9/11) is just about a total surrender of your civil rights and any illusion of control you might have. Realize that when you fly you make a trade: You surrender almost all your rights and they move you across the planet at over 500mph. It is only a good trade if you accept both sides of the deal. If you don’t, then don’t fly because many people will suffer as you hold up the security line.

All airlines are great and all airlines stink

Amongst the people I know who have flown more than a million miles they all have a favorite airline they LOVE and an airline that they HATE. The interesting thing is that they all love/hate different airlines. Even though collectively we have lots of data points (flights) there is no consensus. These love/hate feelings are often rooted in just a few good/bad incidents. On any given day a given airline is either awful or glorious.

Airline employees

Airlines are huge companies so don’t expect perfection from all 120,000 employees. For all large groups there is often a bell-shaped curve of performance. A few do wondrous work, the vast bulge in the middle do what is expected, and a few at the other end take sadistic pleasure in creating a private hell just for you. It’s a crapshoot who you will meet and if they are having a bad/good day.

  • Flight attendants have no power to change anything about the flight, but they do occasionally bring you an extra cookie.  Help them by staying in your seat during meal times and taking your seat quickly when asked to do so.
  • Ticket/Gate agents have almost no power to give you extra perks, but they do have the awesome power of not offering you the help you haven’t specifically asked for. They are the most yelled at employees of any airline. Never yell at them because a flight is delayed, canceled or otherwise screwed up. It is not their fault. Although on general principles I believe in treating people nice because it is the right thing to do, I can assure you that nice people are offered more choices and are sometimes upgraded. It pays to be genuinely nice.

Connections

Almost all flights connect. Choose your connecting flights so you have options and time. If possible, always avoid a connecting flight that is the last one of the day to your destination. When booking a flight, your various flight options are usually sorted so the connecting flight that leaves as soon as possible after your first flight arrives is at the top of the list.   Often that is a tight (not much time to run from plane to plane) connection. Why sweat a tight connection when you can leave a little earlier? I personally like a four-hour connection and in the last 10 years of my business flying I never missed a connection. Not one.

It is always sunny at 35,000 feet

If you like to look out the window, the view is much better if choose the seat that is on the “shady” side of the plane – where the sun is behind you. Think about the flight direction and time of day to figure that out. The “A” seats (as in seat 23A) are on the left side of the plane.seating

Leave early

If (as I have heard so many people loudly proclaim) missing this flight will make you miss some critical event (interview, meeting, wedding, birth, …) then you are a fool for cutting it that close. All airlines have three major partners that they have no control over: the Federal Aviation Administration, Homeland Security, and Mother Nature. If your are traveling for a once-in-a-lifetime, super-important reason, leave two days early. Three days early if it is your wedding. Really.

When trouble strikes

When you fly, and things are not going well, there are a few key facts-of-life you must understand to rationally evaluate what your options are.

  1. There is no spare plane. The cost of keeping a spare 100 million dollar plane sitting around is very high. Even at huge hub airports there is no spare plane and no spare crew waiting to fly it. If your plane breaks then either the passengers are spread out over other flights or the airline cancels some other flight and assigns the plane to your flight.
  2. Flight status displays lie right up to the last minute. The airlines typically show the flight status of your flight as “on time” right up to the last minute. To get a better idea of your probability of flying on time look at the arrivals monitor for the flight that lands at the gate your flight is scheduled to depart from. Typically that flight arrives about an hour before your departure. If it’s delayed…the probability that your departing flight will be delayed goes way up.
  3. Insignificant weather matters. When the airlines say “bad weather at the destination” is causing the delay sometimes it is violent weather. But most times it is just the local conditions that lower the airports overall capacity to move airplanes:
    • Unusual wind conditions can force the airport to use a set of runways that has less capacity for takeoffs and landings.
    • Visibility can be just bad enough so that they have to switch to a different set of flight rules that either further spaces out takeoffs and landings or it prevents planes from simultaneously landing on parallel runways.
  4. Planes do not hurry. Once in the air, the captain of a delayed flight will often say something like “We will do what we can to make up time.” What they can do is basically nothing. Fuel is expensive and going just a little faster burns a lot more fuel. Also the difference between the cruise and max speeds, for several commonly used jet aircraft, is less than 10%. If you get in the air late, you will arrive late.

If you think about each one of the above facts of life you might see analogs in the computer performance work you do.

When you are stucksnow

When a huge storm shuts down the airport… accept your fate. The only rational thing to do is to ride the chaos with grace and style. Stay flexible, stay pleasant, and be helpful to others. Plan to convert this dreary experience into a great story. You can write the story of The Massive Airport Blizzard to read either:

  1. I yelled at dozens of people to no effect and got home two days late.
  2. I had some really interesting conversations, helped someone, made a new friend, and got home two days late.

It is your choice. Choose to be happy, because grumpy rarely works for anyone.


I also have many useful hints about doing computer performance work once you land at your destination in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

Three Tools You Should Build

Given that it is a good idea to keep an eye on performance all the time, there are lots of companies that only allow you pay periodic attention to performance. They focus on it when there is a problem, or before the annual peak, but the rest of the year they give you other tasks to work on.

toolsThis is a lot like my old job in Professional Services – A customer has a problem, I fly in, find the trouble, and then don’t see them until the next problem crops up.

To do that job I relied on three tools that I created for myself and that you might start building to help you work on periodic performance problems.

Three Tools

List All – The first tool would dig through the system and list all the things that could be known about the system: config options, OS release, IO, network, number of processes, what files were open, etc. The output was useful by itself as now I had looked in every corner of the system and knew what I was working on. Several times it saved me days of work as the customer had initially logged me into the wrong system. It always made my work easier as I had all the data I needed, in one place, conveniently organized, and in a familiar order.

Changes – If I’d been to this customer before, this tool allowed me to compare the state of the system with the previous state. It just read through the output of the List All I’d just done and compared it with the data I collected on my last visit. Boy, was this useful as I could quickly check the customer’s assurance that “Nothing had changed since my last visit.” I remember the shocked look on the customer’s face when I asked: “Why did you downgrade memory?”

Odd Things – Most performance limiting, or availability threatening, behavior is easy to spot. But for any OS, and any application, there are some things that can really hurt performance that you have to dig for in odd places with obscure meters. These are a pain to look for and are rare, so nobody looks for them. Through the years as I discovered each odd thing, I would write a little tool to help me detect the problem and then I’d add that tool to the end of my odd things tool.  I’d run this tool on every customer system I looked at and, on occasion, I would find something that surprised everyone: “You haven’t backed up this system in over a year.” or solved a performance problem by noticing a foolish less than optimal configuration choice.

With most everything happening on servers somewhere in the net/cloud these days, knowing exactly where you are and what you’ve got to work with is important. Being able to quickly gather that data in a matter of minutes allows you to focus on the problem at hand confident that you’ve done a through job.

Stone_SoupAll three of these tools were built slowly over time. Get started with a few simple things. The output of all three is just text – no fancy GUI interface or pretty plots are required. When you have time, write the code to gather the next most useful bit of information and add that to your tool.

Just like the old folk story of stone soup, your tools will get built over time with the contributions of others. Remember to thank each contributor for the gifts they give you and share what you have freely with others.


Other useful hints can be found in: The Every Computer Performance Book which is available at Amazon, B&N, or Powell’s Books. The e-book is on iTunes.


 

 

The Sample Length of The Meter

Any meter that gives you an averaged value has to average the results over a period of time. If you don’t precisely understand that averaging, then you can get into a lot of trouble.

The two graphs below show exactly the same data with the only difference being the sample length of the meter. In the chart below the data was averaged every minute. Notice the very impressive spike in utilization in the middle of the graph. During this spike this resource had little left to give.dailypeak1

In the chart below the same data was averaged every 10-minutes. Notice that the spike almost disappears as the samples were taken at such times that part of the spike was averaged into different samples. Adjusting the sample length can dramatically change the story.dailypeak2

Some meters just report a count, and you’ve got to know when that count gets reset to zero or rolls over because the value is too big for the variable to hold. Some values start incrementing at system boot, some at process birth.

Some meters calculate the average periodically on their own schedule, and you just sample the current results when you ask for the data. For example, a key utilization meter is calculated once every 60 seconds and, no matter what is going on, the system reports exactly the same utilization figure for the entire 60 seconds. This may sound like a picky detail to you now, but when you need to understand what’s happening in the first 30 seconds of market open, these little details matter.

Below you will see a big difference in the data you collect depending on how you collect and average it.  In the one-second average (red line) you are buried in data. In the one-minute average (sampled in the yellow area) you missed a significant and sustained peak because of when you sampled. The 10-minute average (sampled in the green area) will also look reassuringly low because it averages the peaks and the valleys.avg3

Take the time, when you have the time, to understand exactly when the meters are collected and what period they are averaged over. The best way to do that is to meter a mostly idle system and then use a little program to bring a load onto the system for a very precise amount of time and see what the meters report. The better you understand your tools, the more precisely and powerfully you can use them.


This hint and many others are in: The Every Computer Performance Book which is available at AmazonPowell’s Books, and on iTunes.


 

Thank You

I’d like to take a moment and thank the nice people who have bought my book. For the first time today (July 5, 2014)  the book cracked Amazon’s Top 100 books in computer science. I am deeply grateful.yes

Working A Little Harder

Sometimes you have to work a little harder than you’d like to find the data you need to solve a performance problem or answer a performance question. Sometimes you have no elegant tool and just have to jury-rig some ugly collection of hacks to get what you need.data collection

 

The photographer pictured above was not comfortable, safe, or delighted to have this assignment at the this moment. However, I bet to the end of his days he told this story with great pride. If you don’t have everything you need in its most convenient place, just at the right time… do what you can, with what you got. Yeah, it’s a pain, but it is also the start of a great story.


For more specific hints on getting performance work done see: The Every Computer Performance Book at  AmazonPowell’s Books, and on iTunes.


No Bad Surprises In Public

Never plan to surprise the person responsible for a problem in a public meeting. The goals of performance work are measured in response time and throughput, not in how much drama you create when you point your accusing finger at the unsuspecting culprit.    drama

When you locate a problem, the first person you should find is the person who is responsible for that part of the computing world, and discuss that problem with him or her. Why? That person may know a lot more about that part of your computing world than you do, and may have further insights as to the root cause and the reason(s) why things are done this way. Often, I find that when I privately share my concerns and ask for help in crafting a list of possible solutions, that person is quite willing to be helpful.

I have made the mistake of not involving the person I believed was responsible for the problem and have suffered these consequences, usually in this exact order:

  1. The person responsible for that part of the computing world got angry and defensive and worked relentlessly to tear down my work and credibility.
  2. That person points out my ignorance and further points out the real problem is caused by some other part of the computing world owned by a different person. Now there are two angry people in the room.
  3. Now the manager becomes angry with me for creating tension among the staff.

It always works better when I talk to the responsible person privately well before I write up my recommendations. We look at the problem and explore solutions. Then I can walk into the meeting and say something like: “The problem is in this subnet. With the help of your networking guru, we have a few ideas on how to improve the situation.


For more hints on presenting your work see: The Every Computer Performance Book at  AmazonPowell’s Books, and on iTunes.


 

Designing A Performance Presentation

This is a collection of hints on presenting performance results that have worked for me throughout the years as I’ve presented my results to both friendly and skeptical audiences of managers, technical staff, and executives all the way up to the CIO/CEO level.  This is not generic advice on public speaking. You can find that elsewhere.

seerTo Reveal The Future…

When presenting your results, in many ways you are like the Crystal Seer. Perhaps the turban and the crystal ball would be a little over the top for your presentation in conference room three, but overall this is not a bad metaphor.

When doing performance work, you are uncovering a hidden truth few can see, and predicting the future.

We have all seen a poorly explained truth go down in flames and a beautifully told lie carry the day. If the inmates are running the asylum where you work, then they are most likely very good at presenting their very bad ideas.

How clearly and convincingly you present your results determines how successful you are.

Proof

As Carl Sagan once said, “Extraordinary claims require extraordinary evidence.” Look at your results and conclusions and ask yourself how your audience will react.

The more disruptive, shocking, or expensive your conclusions and recommendations are, the more backup data you need and the more effort you want to expend in making an airtight case. If you are claiming bacon is good for you, then you will have an easier time with the National Pork Producers Board than with a group of vegan cardiologists.

However, just because you have 30 backup slides for your shocking revelation, doesn’t mean you need to show them all.  Pay attention to your audience. Once you’ve convinced them, forget the remaining 24 backup slides, and move on to your next point.

The Nature of Truth

pinocchio
When preparing to present your report, there can be tremendous pressure to lie. Your work may help justify a purchase everyone wants to make or force unpleasant changes that no one wants to endure. The politics can get very serious.

First and foremost, stick tightly to the data you collected. It is the truth. Everything you do, say, and recommend flows from it. Never change that data. Never cherry pick the “good” numbers. Never ignore the bad numbers. If the powers that be order you to change that data, then start looking for another job because this is not the place anyone wants to work.

Be open to other interpretations of the data. If they do not violate the laws of physics, or performance, they may be valid. A device being 50% busy is a fact. What that fact means depends on the question at hand and the business realities that you have to live within. I’ve done performance work at companies where 30% busy on a peak day was a crisis and others where 95% busy was the norm.  Both companies were doing wildly different things with their machines, but they, and their customers, were quite content with the performance they were getting.

Simplify

You’ve done weeks of monitoring, calculation, and testing, and now you’ve got to explain your work to people who have been (for the most part) blissfully ignorant of your efforts and struggles. There is the natural tendency to show the detail and talk at length about how hard you worked. Don’t do that.

diamondIt takes an incredible amount of ingenuity, work, skill, and craftsmanship to lift a raw diamond out of the Earth and craft it into a sparkling gem. The same is true of your work on this performance project. In both cases, the end product is prized for its clarity. That clarity comes from the internal structure, the lack of flaws, and the raw material you discarded. When writing and presenting be a minimalist.

You will be presenting to people who have natural limits. Most people, as a rule, are not that good at holding several numbers in their head simultaneously. People also have a finite ability to give a damn about what you are saying.  When you exceed that limit, they stop listening, even if you are explaining how to make perfect $20 bills on a laser printer. What follows are some goals to strive for when crafting a presentation.

Eliminate anything extraneous, as every new thing takes energy to understand. For example, your system might have 15 tuning parameters, but when only three of them matter to this question, put the rest of them (if you include them at all) in an appendix. This gets them out of the main flow of your presentation, and yet it shows you did your due diligence.

Make sure that each point you make requires your audience to remember no more than two numbers at the same time. Having a nicely designed table of numbers is fine, as no memorization is required.

platesEvery new and unfamiliar term you introduce is one more plate they have to mentally keep spinning as you are building your next point.

Use consistent terms and introduce the least number of new terms possible. Call a “dog” a “dog” all the way though your presentation.

Since graphs are a key element of most performance presentations, do your audience a favor and label your graphs consistently. Put a title and a legend on each graph, and put them in the same place. Label the X and Y axis. Strive to use a common unit (bits vs. bytes) in all your graphs. Use consistent colors so they quickly learn, for example, that the metered values are always blue, the projected values are always red, and the theoretical limit line is always black.

Lastly, establish a pattern in your presentation so people know what to expect. For example, imagine you had capacity planned the performance of a key computer at a future peak. In your presentation, for each subsystem of that computer, show where you are now, then show the projected peak, then state if this will be a problem, and lastly describe any proposed changes to work around the problem. People like a repeating pattern of information in a presentation. They find this comforting and an aid to overall understanding.

The Invisible Presentation

If your audience has trouble seeing what you are presenting, then it is harder for them to understand your wisdom.

hard1

Use a big font (24-30 point), that is easy to read (no funky fonts), and has high contrast (black letters on a white background).  Save the small fonts for your written summary, and avoid colored fonts on a colored background like the plague.

In most meetings a significant part of the audience can’t see the bottom 25% of your slide due to the people in front of them.  Put the good stuff at the top of the slide. Reserve the bottom for things that are less important to the question at hand, like the page number or the snazzy corporate artwork.

hard2Depending on where you live, seven to nine percent of men and 0.4% of women are red/green colorblind, be sure not to make these the two the most critical colors on your graphs.  Also, everyone is colorblind when looking at a black and white printed copy of your presentation.


For more info on performance work see: The Every Computer Performance Book at  AmazonPowell’s Books, and on iTunes.


 

Money Changes Everything

Sometimes performance work is all about the customer: “What do we need to handle the seasonal peak with reasonable response time?” Sometimes performance work is all about the money: “Can we cut 30% of our IT budget?” When it is all about the money, you have to have some financial numbers to work with that everyone agrees upon. Specifically, you need a target amount to save and the relevant costs of the major pieces of hardware in your computing world. Get those financial numbers first, and make sure that everyone is in agreement on them. Now go do your performance work.

For a money-centered question, design your talk to lead with the money, because that is what your audience is focused on. Then talk about what is possible and what, if any, pain will result. For example:

  • The goal was to cut the IT hardware budget by 30%. That can be done, and 11 months out of the year all will be well. However, at your seasonal peak, my data predicts horrible response times.
  • I believe you can save $250K by making the following changes with no change to your average response times.

The Real Cost of Bad Performance

peasWhen looking at the cost of lost business, it can be useful to look at the lifetime value of a customer, not just the cost of “losing” X transactions.

For example: Imagine a grocery store refuses a return on a bad can of peas. It saved a $1.29 by doing so. However, if that customer buys $200 of groceries a week, 50 weeks a year, and lives near that store for five years, then the lifetime value of that customer is about $50,000. What is the real cost of losing a customer due to slow response time issues?

Hints On How To Meter Load Tests

Load testing is where you give your computing world an artificially generated workload to see how well it holds up.

planes

To meter the test itself look at your computing world with your normal meters that you run every day. During the test, look for two things.

  1. Did the load test look a normal load from real users?
  2. Did the load test hit a problem?

Looking For Normal

You know what the resource utilization looks like for your computing world under a moderate load. Do those numbers look similar to what you see during a moderate simulated load?  If key resources are unusually idle or busy during the load test then the load test itself is not doing the best job of replicating a user-generated load. High precision here is not required. If the load test load is close, within 10-20% of the expected user-generated load, that will usually do. However, remember queuing effects. The higher the utilization, the more precise you have to be, because at high utilizations a small increase in utilization can cause a big increase in response time.

Load tests have goals they want to reach that are usually expressed in terms of number of things/sec the system has to handle. Unlike normal user load, a load test generated load starts, stops, and changes at your command. When metering this, you should adjust your metering frequency to capture multiple samples at key parts of the load test and adjust your meter start time so it’s nicely synchronized with the load test. In general, you want your meters to start sampling at the top of each minute and be running before, during, and for a while after the load test. The before and after data can help you determine if anything unusual was happening that might have skewed the results.

If the load test achieved its goals, that’s great. Report how your computing world handled the load with a focus on any resource that looks like it is close to bottlenecking. Point out queues that are getting long, devices with a high utilization, resources that are close to running out of capacity, etc. Since load tests typically don’t simulate all transactions, they tend to under-utilize things. If some resource is close to a limit, you still might recommend adding more just to be safe.

If the load test failed to make its goals, that’s unfortunate. Report how your computing world handled the load with a focus on what bottlenecked during the test and what will bottleneck when the test eventually hits its goal load. How does that work?  Let’s look at the data.

The table below has the results of a load test. The goal was 800 TX/sec. Everything was fine at 200 TX/sec, but the test bottlenecked at 400 TX/sec. It is clear that device-Y (at 94% busy) will limit any further significant progress towards our goal.

loadg2

However, to reach our goal of 800TX/sec, our load test will eventually have to push twice the number of transactions per second through the system. Therefore it is reasonable to assume that every other device in the transaction path will have double the utilization. Take all the utilizations you measured at 400 TX/sec and double them to see what else we would run out of at 800TX/sec.

loadg3

Device-X will be at 100% busy at your projected peak. That’s never a good thing. Noticing two problems with one load test is smart testing. It will save you time, money and embarrassment. I’d add more device-X and device-Y before I tried another load test.

Looking For Problems

When metering a load test, sometimes there will seem to be a dramatic increase in efficiency where the load test is pushing a lot more transactions through your computing world with dramatically lower resource utilizations. I’m sorry to have to tell you, but this is always bad news.

Computer programs never suddenly get more efficient under a heavy load. What they do is start failing and sometimes it can be faster to fail than to succeed. Returning a simple error message is faster than providing a complex answer. Keep an eye on any reported errors and note any dramatic increase in errors as the load increases.

This post is from my book The Every Computer Performance Book, which you can find on Amazon and iTunes.