| Management resists, the guerrilla planner retreats. |
| Management dithers, the guerrilla planner proposes. |
| Management relents, the guerrilla planner promotes. |
| Management retreats, the guerrilla planner pursues. |
Copying someone else's apparent success is like cheating on a test. You may make the grade this time, but how far will the bluff take you into the future?Best practices are often derived from financial accounting attempts to cut costs by removing (ignoring) variance in the relevant processes and prescribing a one size fits all solution: even though it's not described that way. It's always interesting to compare the typical multiplicity of bests in best practices.
A recent NPR program discussed how this approach has backfired in the arena of medicine and healthcare. The analogy is appropriate because many aspects of performance analysis are not too different from the science of medical diagnosis.See the section on The Limits Of Best Practices. See also "10 Practices of Highly Ineffective Software Developers." I would add the following (which actually did happen):
Translation: We performance weenies need more whistles and less bells. In other words, virtualization vendors need to make sure they provide us with backdoors and peepholes so we can measure how resources are actually being consumed in a virtualized environment.Corollary: Can you say, transparency? It's better for the IT support of business if we can manage it properly. To manage it, it can't be illusory.
Then, explain the multi-billion dollar dietary-supplements industry?It's not what you sell, but how you sell it. Recent studies* also suggest that it may be harder to sell prevention in English. The more clearly a language distinguishes the future (tense), the less likely the speaker is to feel threatened by what will happen there. Roughly put: if it's not happening to me now, it won't happen to me. * The Effect of Language on Economic Behavior: Evidence from Savings Rates, Health Behaviors, and Retirement Assets
"I can understand people being worked up about safety and quality with the welds," said Steve Heminger, executive director ... "But we're concerned about being on schedule because we are racing against the next earthquake."This is a direct quote from a Caltrans executive manager for the new Bay Bridge construction between Oakland and San Francisco. He is saying that Caltrans management decided to ignore the independent assessement of the welding quality in order to stay on schedule. Yikes! See mantra 1.7. Although he is not an IT manager, the point about BRisk management is the same. You can read more background on this topic on my blog.
Until you read and heed this statement, you will probably have a very frustrating time getting your perforance analysis conclusions across to managment.See mantra 1.6. There, a section of the upper deck of the current Bay Bridge collapsed during the Loma Prieta earthquake of 1989. Now, the Caltrans manager is watching the clock and concluding that it's better to increase the risk that the new bridge will fail (by being brisk about weld inspections), in order to beat the much lower risk that the current bridge might fail again in some unpredictable future quake. Substitute your favorite IT project, product or application for the word bridge and you get the idea.
Cap and Perf management can rightly be regarded as just a subset of systems management, but the infrastructure requirements for successful capacity planning (both the tools and knowledgeable humans to use them) are necessarily out of proportion with the requirements for simpler systems management tasks like software distribution, security, backup, etc. It's self-defeating to try doing capacity planning on the cheap.
Think about it. Performance analysis is a lot like a medical examination, and medical Expert Systems were heavily touted in the mid 1980's. You don't hear about them anymore. And you know that if it worked, HMO's would be all over it. It's a laudable goal but if you lose your job, it won't be because of some expert performance robot.
Today, there is a serious need to squeeze more out of your current capital equipment.
The planning part of capacity planning requires making predictions. Even a wrong prediction is useful because it can serve as a warning that either:Either way, something needs to be corrected, but you wouldn't realize that without making a prediction in the first place. If you aren't iteratively correcting predictions throughout a project life-cycle, you will only know things are amiss when it's too late! GCaP says you can do better than that.
- the understanding which underlies your prediction is wrong
- the measurement process is broken and is producing wrong data
If your local network is out of bandwidth, has interminable latencies, or is otherwise glitching, don't bitch about the performance of your application. Get the LAN fixed, first. Then we'll talk about the performance of your application.
In case you're wondering, those are REAL data and yes, the axes are correctly labeled. I'll let you ponder what is wrong with these measurements. Here's a hint: They're so broken, they're not even wrong! Only if you don't understand basic queueing theory, would you accept measurements like these (which the original performance engineer did).
Most people remain blissfully unaware of the fact that ALL measurements come with errors; both systematic and random. An important capacity planning task is to determine and track the magnitude of the errors in your performance data. Every datum should come with a `±' attached (which will then force you to put a number after it).
Data comes from the Devil, only models come from God. Corollary: That means it's helpful to be able to talk to God. But God, she does babel, sometimes. :)
Western culture too often glorifies hours clocked as productive work. If you don't take time off to come up for air and reflect on what you're doing, how are you going to know when you're wrong?
| (1) |
| (2) |
I use it almost daily to cross-check that throughput and delay data are consistent, no matter whether those data come from measurements or models. More details about Little's law can be found in Analyzing Computer System Performance with Perl::PDQ. Another use of Little's law is calculating service times, which are notoriously difficult to measure directly. See also Mantras 2.10 and 2.12.And here's the lore behind Little's Law.
The bigger the symmetric multiprocessor (SMP) configuration you purchase, the busier you need to run it. But only to the point where the average run-queue begins to grow. Any busier and the user's response time will rapidly start to climb through the roof.
In any collection of processes (whether manufacturing or computing) the bottleneck is the process with the longest service demand. In any set of processes with inhomogeneous service time requirements, one of them will require the maximum service demand. That's the bottleneck. If you reduce that service demand (e.g., via performance tuning), then another process will have the new maximum service demand. These are sometimes referred to as the primary and the secondary bottlenecks. The performance tuning effort has simply re-ranked the bottleneck.For an even deeper perspective, see my blog post: Characterizing Performance Bottlenecks.
The purpose of competitive benchmarking a computer system is to beat everyone else on performance, so you can say "mine is bigger than yours." It's the IT equivalent of war! Benchmark run-rules were made to be broken or at least bent; just don't get caught. This must be true because industrial benchmark organizations like SPEC.org and TPC.org have technical review committees that look for cheating in submitted results. For capacity planning and system sizing, you need to be aware of this fact of life and look for the loopholes in published benchmark results. Here, competitive refers to industrial benchmarks, as opposed to benchmarking that you might do for purely internal comparisons or diagnostic purposes.
"What we've got here is failure to communicate." —Prison warden in the movie Cool Hand Luke.The purpose of your capacity planning presentation is to communicate the findings of your analysis. No doubt, it takes a long time to do the analysis correctly. So, the last thing you want is a FAIL.
"Je n'ai fait celle-ci plus longue que parce que je n'ai pas eu le loisir de la faire plus courte." —Blaise Pascal (Dec 4, 1656). Translation: I have made it [this letter] longer because I have not had time to make it shorter.
When consolidating applications, it's not the migration that's the problem, it's the often mistaken motivation. The fallacious reasoning goes like this. Since there are a slew of servers that are only running at 10% busy, migrate those apps onto fewer (virtualized) servers. Naturally, the target server capacity will be better utilized but the application response time (SLA) has now gone out the window.When you remove the necessary mips, the users will come ... and complain.
Control over the performance of hardware resources e.g., CPUs and disks, is progressively being eroded as these things simply become commodity black boxes viz., multicore processors and disk arrays. This situation will only be exacerbated with the advent of Internet-based application services. Software developers will therefore have to understand more about the performance and capacity planning implications of their designs running on these black boxes. (See Sect. 3)
Thanks to the Puritans, presumably, American corporate culture is obsessed with the false notion that being busy is being productive. Wrong! Europeans (especially the Mediterraneans) understand the power of the cat nap. After nearly 400 years, it's time for America to get over it."Creativity is the residue of time wasted." —A. EinsteinWho works the longest hours?Data come from the OECD (Organisation for Economic Co-operation and Development). Developing countries often work longer hours, but working longer doesn't necessarily mean working better. Interestingly, the USA is in the middle along with Australia. ![]()
A number of recent books and presentations on performance analysis and capacity planning have appeared with "The Art of..." in the title. In itself, this is not new. The title of Raj Jain's excellent 1991 book is The Art of Computer Systems Performance Analysis. Nonetheless, they all resort to various scientific techniques to analyze performance data. The application of science inevitably involves some art.
The ITIL framework is all about defining IT processes to satisfy business needs, not their implementation. That's what makes GCaP training an excellent IT-business solution.
If you do performance management perfectly, you run the risk of becoming invisible and therefore expendable to your management.
In other words, having successfully supported performance management, a manager could eventually feel justified in asking: "Why is my budget paying for performance management when everything is performing perfectly?" AKA a career-limiter. Compare this with software development, for example. If a developer does their job perfectly, they risk being overburdened with more work than you can handle (AKA job security). In contrast to performance management, a manager might be heard to say: "We must have this new functionality in the next release!"
Quantum transitions in energy are only associated with the discrete spectrum of atomic or molecular bound states.Try to avoid communication nonsensica (GMantra 1.31).
That last one was tweeted and ended up as entry #136 in David Pogue's Twitter Book:My most recent favorite is this one. WAYNE SWAN (politician): ``I will not rule anything in or rule anything out.'' Ruling out, I get: take a ruler and draw a line through it. But how do you rule something in!? And Mr. Swan didn't invent it. He's just mindlessly repeating it because he heard other boffins say it, and I'm quoting him because it was captured in a transcript. (Double jeopardy)
![]()
A common situation in big organizations is for various groups to be responsible for the performance evaluation of the software or hardware widgets they create. In principle, this is a good thing, but there is a downside. Over-zealous performance optimization of any subsystem can deoptimize the entire system. To avoid this side-effect, a separate central group should be responsible for oversight of total system performance. They should act as both reviewers and gatekeepers for the performance analyses produced by all the other groups in the organization.
| Nines | Percent | Downtime/Year | σ Level |
| 4 | 99.99% | 52.596 minutes | 4σ |
| 5 | 99.999% | 5.2596 minutes | - |
| 6 | 99.9999% | 31.5576 seconds | 5σ |
| 7 | 99.99999% | 3.15576 seconds | - |
| 8 | 99.999999% | 315.6 milliseconds | 6σ |
downt <- function(nines,tunit=c('s','m','h')) {
ds <- 10^(-nines) * 365.25*24*60*60
if(tunit == 's') { ts <- 1; tu <- "seconds" }
if(tunit == 'm') { ts <- 60; tu <- "minutes" }
if(tunit == 'h') { ts <- 3600; tu <- "hours" }
return(sprintf("Downtime per year at %d nines: %g %s", nines, ds/ts,tu))
}
> downt(5,'m')
[1] "Downtime per year at 5 nines: 5.2596 minutes"
> downt(8,'s')
[1] "Downtime per year at 8 nines: 0.315576 seconds"
6σ is the "black belt" level that many companies
aspire to. The associated σ levels correspond to the area contained under the standard normal (or "bell shaped") curve
within that σ interval about the mean. It can be computed using the following R function:
library(NORMT3)
sigp <- function(sigma) {
sigma <- as.integer(sigma)
apc <- erf(sigma/sqrt(2))
return(sprintf("%d-sigma bell area: %10.8f%%; Prob(chance): %e",
sigma, apc*100, 1-apc))
}
> sigp(2)
[1] "2-sigma bell area: 95.44997361%; Prob(chance): 4.550026e-02"
> sigp(5)
[1] "5-sigma bell area: 99.99994267%; Prob(chance): 5.733031e-07"
So, 5σ corresponds to slightly more than 99.9999% of the area under in the bell curve; the total area being 100%.
It also corresponds closely to six 9s availability.
The last number is the probability that you happened to achieve that availability by random luck or pure chance.
A reasonable mnemonic for some of these values is:
![]() |
Because of the great diversity of time scales that exist in modern computer systems, it's a good idea to try and get
a more intuitive feel for some of them.
My first attempt at helping you to do that was in 1998, where I included the table shown at the left in
The Practical Performance Analyst.
I updated that as Table 3.1 in my Perl PDQ book. Here, I've rendered those quantities as a data frame in R:
Subsystem nanoSeconds secondZ Rescaling SIunit
1 4 GHz CPU clock 2.50e-01 0.25 1.00 s
2 L1 cache access 5.00e-01 0.50 1.00 s
3 L2 cache access 1.25e+00 1.25 1.00 s
4 Memory bus cycle 2.00e+00 2.00 1.00 s
5 DRAM chip access 6.00e+01 60.00 1.00 min
6 Physical disk seek 3.50e+06 3500000.00 1.33 month
7 Network NFS read 3.20e+07 32000000.00 1.01 yr
8 Database SQL update 5.00e+08 500000000.00 15.84 yr
9 Magnetic tape read 5.00e+09 5000000000.00 1.58 century
To get a better impression of the vast range of time scales, actual nanoseconds are rescaled to be units measured in seconds. In other words, a multiplier of a gigasecond has been applied to the actual time values to make them human scale. The key point is that since a CPU processor operates on a nanosecond time-scale in the real world or ``seconds'' on the rescaled human level, RAM accesses take on the order of minutes, disk accesses take on the order of a month, database accesses take years, and magetic-tape IOs take on the order of a century. Looked at from the standpoint of the CPU processor clock, this gives a somewhat different slant to the expression: Hurry up and wait! |
On closer scrutiny, one finds many problems with the conventional wisdom:Using the "flat" approximation between throughput and latency may be appropriate for certain systems, as long as the more global truth doesn't go unrecognized. See my blog for a deeper discussion of this topic:
- Bandwidth and latency are ill-defined terms.
- Latency refers to some kind of (heavily context-dependent) delay, e.g., disk latency means something completely different from packet latency.
- Queueing theory is very specific about different latencies: waiting time, service time, response time, etc.
- Bandwidth usually refers to a special case of throughput, i.e., maximum throughput.
- Throughput and latency only appear to be unrelated in clocked systems; like flow through a pipe (see above).
- If you apply the conventional wisdom to database performance, for example, you will be surprised to find that latency (e.g., response time) increases with throughput.
"90% of coding is debugging. The other 10% is writing bugs."
A performance model should be as simple as possible, but no simpler!Someone else said:
"A designer knows that he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away." —Antoine de St-Expurey
I now tell people in my Guerrilla classes, despite the fact that I repeat this rule of thumb several times, you will throw the kitchen sink into your performance models; at least, early on as you first learn how to create them. It's almost axiomatic: the more you know about the system architecture, the more detail you will try to throw into the model. The goal, in fact, is the opposite.
The Tube map is pure abstraction that has very little to do with the physical railway system. It encodes only sufficient detail to enable transit on the underground from point A to point B. It does not include a lot of irrelevant details such as altitude of the stations, or even their actual geographical proximity. A performance model is a similar kind of abstraction.
Despite several attempts, the original Tube map has hardly been improved upon since its conception in 1933. Apparently, it already met the requirement of being as simple as possible, but no simpler. The fact that it was designed by an electrical draughtsman, probably helped.
As an example, consider a classic multi-user time-share OS, e.g., Unix. Its PoP can be stated as follows:Now that you get the idea, here are a couple of questions for you to ponder:Time-share scheduling gives every user the illusion that they are the only user on the system. [17 words]All the thousands of lines of code in the OS that implement time-slicing, priority queues, and so forth, are there merely to support that PoP. The performance goal in that case is user response time. Being able to see things in this unconventional way can really help in the deciding on the simplest kind of performance model.
You, as the performance analyst or planner, only have to shine the light in the right place and then stand back while others flock to fix it.
One place to start constructing a PDQ model is by drawing a functional block diagram. The objective is to identify where time is spent at each stage in processing the workload of interest. Ultimately, each functional block is converted to a queueing subsystem like those shown above. This includes the ability to distinguish sequential and parallel processing. Other diagrammatic techniques e.g., UML diagrams, may also be useful but I don't understand that stuff and never tried it. See Chap. 6 "Pretty Damn Quick(PDQ) - A Slow Introduction" of Analyzing Computer System Performance with Perl::PDQ.
Take Little's law Q = X R for example. It is a performance model; albeit a simple equation or operational law, but a model nonetheless. All the variables on the RIGHT side of the equation (X and R) are INPUTS, and the single variable on the LEFT is the OUTPUT. A more detailed discussion of this point is presented in Chap. 6 "Pretty Damn Quick(PDQ) - A Slow Introduction" of Analyzing Computer System Performance with Perl::PDQ.
If the measurements of the real system do not include the service time for a queueing node that you think ought to be in your PDQ model, then that PDQ node cannot be defined.
Suppose, for example, you had requests coming into an HTTP server and you could measure its CPU utilization with some UNIX tool like vmstat, and you would like to know the service time of the HTTP Gets. UNIX won't tell you, but you can use Little's law (U = X S) to figure it out. If you can measure the arrival rate of requests in Gets/sec (X) and the CPU %utilization (U), then the average service time (S) for a Get is easily calculated from the quotient U/X.
An open queueing model assumes an infinite population of requesters initiating requests at an arrival rate λ (lambda). In a closed model, λ (lambda) is approximated by the ratio N/Z. Treat the thinktime Z as a free parameter, and choose a value (by trial and error) that keeps N/Z constant as you make N larger in your PDQ model. Eventually, at some value of N, the OUTPUTS of both the closed and open models will agree to some reasonable approximation.
Apart from making an unwieldy PDQ report to read, generally you are only interested in the interaction of 2 workloads (pairwise comparison). Everything else goes in the third (AKA "the background"). If you can't see how to do this, you're probably not ready to create the PDQ model.
Predicting the future is not the same thing as "seeing" the future. A performance model is just a means for evaluating the data that are provided to it. The purpose of the model is to transform those data into information. The amount of information that can be extracted is intimately dependent on the values of those data. Change the input data and you change the output information. An oft-quoted example is: garbage in, garbage out. That's an extreme example. More commonly, you may see unexpected modeling results. In that case, the new data do not meet the expectations set by the prior meaurements. But that doesn't necessarily imply the model is wrong. More likely something has changed in the meaurement system so that it has failed to remain consistent with the initial information contained in the previous measurements. Having a performance model forces you to ask why that unexpected change has occurred and can anything be done to remove it. Without a performance model, you don't have any expectations or context for asking such questions.
Colorless green ideas sleep furiously. —N. Chomsky (1957)
New research indicates that if a person is not in control of the situation, they are more likely to see patterns where none exist, suffer illusions and believe in conspiracy theories.In the context of computer performance analysis, the same conclusion might well apply when looking at data collected from a system that you don't understand.
The conventional view of performance models is that they are useful for:A performance model can also be used for interpretating performance measurements. Both the measurements and the model must be consistent or something is wrong and needs to be explained.
- Predicting the future performance of an extant system
- Exploring what-if scenarios that may or may not be realistic
Ulysses is the the Greek hero in Homer's Odyssey. On his way back from the Trojan wars, Ulysses orders his men to tie him to the mast of his ship and to plug their own ears so that they will not succumb to the beautiful song of the sirens and be diverted to their deaths. Ulysses, being a typical manager, chooses to be bound and to keep his ears unplugged because he cannot bear the idea of not hearing the sirens' music.
For a more complete discussion, see:
- See Chapter 4 in 1st edn. of my Perl::PDQ book
- See Chapter 6 in 2nd edn. of my Perl::PDQ book
- Read the original online articles (as PDFs):
- How to convert the absolute load average to the relative stretch factor metric
Measurements are expressed using numbers, but measurement is not the same as mathematics. Mathematics involves exact quantities, e.g., the prime numbers. They are exact by definition. The number 11 is provably a prime number. Exactly. It's not approximately a prime number. Measured numbers, on the other hand, are the result of a process involving comparisons. It's an estimation procedure and therefore cannot produce exact numbers; despite the fact that measurements are often displayed as though they were exact numbers, e.g., 11 seconds or 23% busy. More on that shortly. The measurement process tries to determine the closeness of the comparison. Since the assessed closeness is always an approximation, it can never be exact (like a prime number) and therefore, must contain errors. In that sense, measured numbers look more like the mathematical representation of irrational numbers, e.g., π. By definition, π = C/D is the ratio of the circumference of a circle C to its diameter D. We can think of the diameter as being like a ruler or tape measure where the ratio C/D is like asking, how many diameters there are in the circumference? One possible result is the ratio 3/1, i.e., there are 3 diameters in the circumference. But you already know that the exact number 3 is less than the value of π. If the circumference was π meters long, there would be a small gap remaining after you stepped your meter-ruler three times along the circumference. On the other hand, you might be willing to accept π = 3 as a rough approximation, and live with the gap, depending on how you plan to use that value. For example, it might be good enough to do a quick calculation in your head. Successively better approximations could involve ratios like: 7/2, 22/7, 333/106, 355/113. These ratios are analogous to reading off graduation marks on your ruler to improve the precision of your measurement. But none of these ratios or fractions are exactly equal to π. That's why π is declared to be ir-ratio-nal: unable to be written exactly as a ratio of whole numbers. Nor is a more familiar number, like 3.1415926, any better. That's just an inexact decimal fraction. They are all approximations, with attendant errors. The more important question is, how big is the error? A secondary question is, can you tolerate that level of error? It's a bit of an indictment that all performance measurement tools do not display errors. That omission leads people to believe (inadvertently) that a reported number, like 23% CPU busy, is the same number as the 9th prime number (viz., 23). Since all performance metrics are measured (indeed, those two words have the same root word), it should really be displayed as 23 ± 5% (for example) to remind us that there is always error or "wrongness" in the measurement.
Predicting the future is not the same thing as "seeing" the future, in the sense of seeing what things might look like. See GMantra 2.19 for more on this point. All predictions are merely estimates based on (input) data (or other estimates) and therefore predictions come with errors. The only real question is, how much error can you tolerate?
A Google search returns a lot of data, not information. Google admits that: Are you feeling lucky? It should be a certainty, but it's not. Sneakily, Google also knows that your brain will quickly decide what is information in all those links, such that you don't even realize you're doing most of the work which you unconsciously attribute to Google. They know your brain craves patterns.Collecting performance data is only one half the story: rather like a Google search. You still have to decide what information (if any) is contained in all those data. Unlike a Google search, however, performance data are not simple text (simple for your brain, that is). Performance data usually come in the form of a torrent of numbers which, unlike text, are not simple for your bain to comprehend. Even worse, without doing the proper analysis, the data can be deceptive and lead you to the wrong conclusion. That's where performance analysis tools and models come in. They act as transformers on the data to help your brain decide what is information.
Anything that has to call itself a "science," usually isn't. (Social Science?) How about Information Science?Pretty soon we'll be talking about Information Technology (IT). Oh, wait!...