Guerrilla Mantras Online

Guerrilla Mantras Online

CaP bombs you can drop on your boss or colleagues in team meetings

Updated on Jul 20, 2014


The Guerrilla Manifesto
Management resists, the guerrilla planner retreats.
Management dithers, the guerrilla planner proposes.
Management relents, the guerrilla planner promotes.
Management retreats, the guerrilla planner pursues.


This web page contains:
  1. Updates to the print version of The Guerrilla Manual provided as a pull-out booklet in the original Guerrilla Capacity Planning book.
  2. New aphorisms encapsulated as Guerrilla Mantras or GM that automatically appear on Twitter.

Contents

1  WEAPONS OF MASS INSTRUCTION
    1.1  Why Go Guerrilla?
    1.2  Best Practices
    1.3  Virtualization
    1.4  An Ounce of Prevention
    1.5  Why is Performance Analysis Hard?
    1.6  Brisk vs. Risk Management
    1.7  Failing On Time
    1.8  Performance Homunculus
    1.9  Self Tuning Applications
    1.10  Squeezing Capacity
    1.11  When Wrong is Right
    1.12  Throw More Hardware at It
    1.13  Network Performance
    1.14  Not even wrong!
    1.15  Checked your measurements lately?
    1.16  Data Are Not Divine
    1.17  Busy work
    1.18  Little's Law
    1.19  Bigger is Not Always Better
    1.20  Bottlenecks
    1.21  Benchmarks
    1.22  Failure to Communicate
    1.23  Consolidation
    1.24  Control Freaks Unite!
    1.25  Productivity
    1.26  Art vs. Science
    1.27  ITIL for Guerrillas
    1.28  Performance Paradox
    1.29  Dumb Question
    1.30  Quantum Leap
    1.31  Don't Be a Sheeple
    1.32  Good Communication
    1.33  Performance Gatekeepers
    1.34  Performance Analysis is a Money Sink
    1.35  Tyranny of the 9s
    1.36  Time Scale Intuition
    1.37  Throughput vs. Latency
    1.38  The Zeroth Performance Metric
    1.39  Only Three Performance Metric
2  PERFORMANCE MODELING RULES OF THUMB
    2.1  What is Performance Modeling?
    2.2  Monitoring vs. Modeling
    2.3  Keep It Simple
    2.4  More Like The Map Than The Metro
    2.5  The Big Picture
    2.6  A Point of Principle
    2.7  Guilt is Golden
    2.8  What is a Queue?
    2.9  Where to Start?
    2.10  Inputs and Outputs
    2.11  No Service, No Queues
    2.12  Estimating Service Times
    2.13  Change the Data
    2.14  Closed or Open Queue?
    2.15  Opening a Closed Queue
    2.16  Steady-State Measurements
    2.17  Transcribing Data
    2.18  Workloads Come in Threes
    2.19  Better Than a Crystal Ball
    2.20  Patterns and Anti-Patterns
    2.21  Interpreting Data
    2.22  Intuition and Modeling
    2.23  Load Average
    2.24  VAMOOS Your Data Angst
    2.25  Measurement Errors
    2.26  Modeling Errors
    2.27  Data Ain't Information
    2.28  Data Science
3  UNIVERSAL SCALABILITY LAW (USL)

1  WEAPONS OF MASS INSTRUCTION

1.1  Why Go Guerrilla?

The planning horizon is now 3 months, thanks to the gnomes on Wall Street. Only Guerrilla-style tactical planning is crazy enough to be compatible with that kind of insanity.

1.2  Best Practices

Best practices are an admission of failure.
Mindlessly following the best practices defined by others is actually a failure to understand your own particular requirements for performance and capacity. If you don't understand your own particular performance and capacity problems, why would you expect other people's practices (i.e, solutions) to be appropriate for you? Copying someone else's apparent success is like cheating on a test. You may be lucky and make the grade this time, but will that bluff still work in the future?
Best practices are like most IT rules of thumb: over time, they RoT.
A recent NPR program discussed how applying best practices has backfired in the arena of medicine and healthcare. This analogy is very appropriate because many aspects of performance analysis are not too different from the science of medical diagnosis. See the section on The Limits Of Best Practices (NPR Sept 21, 2011).
In the context of software performance engineering, see "10 Practices of Highly Ineffective Software Developers." I would also add the following (based on my own experience):
11. Avoid acceptance testing like the plague.
12. Develop on a platform totally unrelated to the production system.
13. Never compare load test requirements and results with actual remote user-based measurements from Keynote, Gomez, etc.

1.3  Virtualization

All virtualization is about illusions and Voltaire said: "Illusion is the first of all pleasures." However, when it comes to IT, even if it provides a more pleasurable experience to perpetrate illusions onto a user, it is not ok to foist them on the performance analyst or capacity planner.
Translation: We performance weenies need more whistles and less bells. In other words, virtualization vendors need to make sure they provide us with backdoors and peepholes so we can measure how resources are actually being consumed in a virtualized environment.
Corollary: Can you say, transparency? It's better for the IT support of business if we can manage it properly. To manage it, it can't be illusory.

1.4  An Ounce of Prevention

Capacity management is largely about prevention. But someone once told me "You can't sell prevention!"; the implication being that an ounce of prevention is worthless.
Then, explain the multi-billion dollar dietary-supplements industry?
It's not what you sell, but how you sell it.
Recent studies* also suggest that it may be harder to sell prevention in English. The more clearly a language distinguishes the future (tense), the less likely the speaker is to feel threatened by what will happen there. Roughly put: if it's not happening to me now, it won't happen to me.
* The Effect of Language on Economic Behavior: Evidence from Savings Rates, Health Behaviors, and Retirement Assets

1.5  Why is Performance Analysis Hard?

Both performance analysis and capacity planning are complicated by your brain thinking linearly about a computer system that operates nonlinearly.
Looked at another way, collecting and analyzing performance metrics is very important, but understanding the relationship between those metrics is vital. Reason? Those metric relationships are nonlinear. That's why we rely on tools like queueing models and the universal scalability law. They encode the correct nonlinearities for us.

1.6  Brisk vs. Risk Management

BRisk management, isn't. Perceived risk (psychology) and managed risk (analysis) are not the same thing.
risk
Here's an actual example of (mis)perceived risk:
"I can understand people being worked up about safety and quality with the welds," said Steve Heminger, executive director ... "But we're concerned about being on schedule because we are racing against the next earthquake."
This is a direct quote from a Caltrans executive manager for the new Bay Bridge construction between Oakland and San Francisco. He is saying that Caltrans management decided to ignore the independent consultant analysis of the welding quality in order to stay on schedule. Yikes! See mantra 1.7.
Although he is not an IT manager, the point about BRisk management is the same.
You can read more background on this topic on my blog.

1.7  Failing On Time

Management will often let a project fail—as long as it fails on time!
Until you read and heed this statement, you will probably have a very frustrating time getting your performance analysis conclusions across to management.
See mantra 1.6. There, a section of the upper deck of the current Bay Bridge collapsed during the Loma Prieta earthquake of 1989. Now, the Caltrans manager is watching the clock and concluding that it's better to increase the risk that the new bridge will fail (by being brisk about weld inspections), in order to beat the much lower risk that the current bridge might fail again in some unpredictable future quake. Substitute your favorite IT project, product or application for the word bridge and you get the idea.
Update: As of May 2013 the original high-risk Caltrans decision has prompted Gov. Jerry Brown to threaten delaying the scheduled Labor Day opening of the new Bay Bridge span. Erm... so, how did this brisk management decision save time (and money)?

1.8  Performance Homunculus

A list of system management activities might include such things as: Of these, all but capacity management has some kind of shrink-wrap or COTS solution. Capacity and performance management cannot be treated as just another bullet item on a list of things to do.
Cap and Perf management is to systems management as the homunculus (sensory proportion) is to the human body (geometric proportion).
Cap and Perf management can rightly be regarded as just a subset of systems management, but the infrastructure requirements for successful capacity planning (both the tools and knowledgeable humans to use them) are necessarily out of proportion with the requirements for simpler systems management tasks like software distribution, security, backup, etc. It's self-defeating to try doing capacity planning on the cheap.

1.9  Self Tuning Applications

Self-tuning applications are not ready for prime time. How can they be when human performance experts get it wrong all the time!?
Think about it. Performance analysis is a lot like a medical examination, and medical Expert Systems were heavily touted in the mid 1980's. You don't hear about them anymore. And you know that if it worked, HMO's would be all over it. It's a laudable goal but if you lose your job, it won't be because of some expert performance robot.

1.10  Squeezing Capacity

Capacity planning is not just about the future anymore.
Today, there is a serious need to squeeze more out of your current capital equipment.

1.11  When Wrong is Right

Capacity planning is about setting expectations. Even wrong expectations are better than no expectations!
Or, as the Chesire Cat said: "If you don't know where you are going, any road will get you there."
The planning part of capacity planning requires making predictions. Even a wrong prediction is useful because it can serve as a warning that either:
  1. the understanding which underlies your prediction is wrong
  2. the measurement process is broken and is producing wrong data
Either way, something needs to be corrected, but you wouldn't realize that without making a prediction in the first place. If you aren't iteratively correcting predictions throughout a project life-cycle, you will only know things are amiss when it's too late! GCaP says you can do better than that.

1.12  Throw More Hardware at It

The classic over-engineering gotcha. Hardware is certainly cheaper today, but a boat-load of cheap PCs from China won't help one iota if the application runs single-threaded.
shipwreck
Single-threadedness can wreck you
This is now my canonical response to the oft-heard platitude: "We don't need no stinkin' capacity planning, we'll just throw more hardware at it." The capacity part is easy. It's the planning part that requires brain power.

1.13  Network Performance

It's never the network. But it might be the network admin. :-)
If your local network is out of bandwidth, has interminable latencies, or is otherwise glitching, don't bitch about the performance of your application. Get the LAN fixed, first. Then we'll talk about the performance of your application.

1.14  Not even wrong!

Here is a plot of benchmarked round-trip times (RTT) for a set of applications as a function of increasing user load (clients). Take a good, long look. If your application has concave response times like these... SHIP IT!
In case you're wondering, those are REAL data and yes, the axes are correctly labeled. I'll let you ponder what is wrong with these measurements. Here's a hint: They're so broken, they're not even wrong! Only if you don't understand basic queueing theory, would you accept measurements like these (which the original performance engineer did).

1.15  Checked your measurements lately?

When I'm asked, "But, how accurate are your performance models?" my canonical response is, "Well, how accurate are your performance DATA!?"
Most people remain blissfully unaware of the fact that ALL measurements come with errors; both systematic and random. An important capacity planning task is to determine and track the magnitude of the errors in your performance data. Every datum should come with a `±' attached (which will then force you to put a number after it).

1.16  Data Are Not Divine

Treating performance data as something divine is a sin.
Data comes from the Devil, only models come from God.
Corollary: That means it's helpful to be able to talk to God. But God, she does babel, sometimes. :)

1.17  Busy work

Busy work does not the truth maketh.
Western culture too often glorifies hours clocked as productive work. If you don't take time off to come up for air and reflect on what you're doing, how are you going to know when you're wrong?

1.18  Little's Law

Little's law means a lot! I must say I don't like the notation on that Wikipedia but, more importantly, Wikipedia fails to point out that there are really two versions of Little's law. (Even John Little doesn't know that.)
  1. Little's BIG law:
    Q = λR
    (1)
    which relates the average queue length (Q) to the residence time R = W + S. Here, W is the average time you spend waiting in line to get your groceries rung up, for example, and S is the average service time it takes to ring up your groceries once you get to the cashier.
    grocery store queue
  2. Little's little law:
    ρ = λS
    (2)
    which often goes by the name Utilization law. It relates the average utilization (ρ), e.g., of the cashier, to the service time (S). Equation (2) is derived from (1) by simply setting W = 0 on the right-hand side of the equation. In both equations, the left-hand side is a pure number, i.e., it has no formal units (% is not a unit).
It is important to realize that eqns.(1) and (2) are really variants of the same law: Little's law. Here's why:
  1. Eqn.(1) tells us the average number of customers or requests in residence.
  2. Eqn.(2) tells us the average number of customers or requests in service.
That second interpretation of utilization can be very important for performance analysis but is often missed in textbooks and elsewhere (including Wikipedia pages).
You should learn Little's law (both versions) by heart.
I use it almost daily as a cross-check to verify that throughput and delay values are consistent, no matter whether those values come from measurements or models. Another use of Little's law is calculating service times, which are notoriously difficult to measure directly. If you know the values of ρ (utilization) and λ (throughput), you can calculate S (service time) using eqn.(2).
More details about Little's law can be found in Analyzing Computer System Performance with Perl::PDQ. See also Mantras 2.10 and 2.12.
Finally, here's the lore behind Little's Law.

1.19  Bigger is Not Always Better

Beware the SMP wall!
The bigger the symmetric multiprocessor (SMP) configuration you purchase, the busier you need to run it. But only to the point where the average run-queue begins to grow. Any busier and the user's response time will rapidly start to climb through the roof.

1.20  Bottlenecks

You never remove a bottleneck, you just move it.
In any collection of processes (whether manufacturing or computing) the bottleneck is the process with the longest service demand. In any set of processes with inhomogeneous service time requirements, one of them will require the maximum service demand. That's the bottleneck. If you reduce that service demand (e.g., via performance tuning), then another process will have the new maximum service demand. These are sometimes referred to as the primary and the secondary bottlenecks. The performance tuning effort has simply re-ranked the bottleneck.
For an even deeper perspective, see my blog post: Characterizing Performance Bottlenecks.

1.21  Benchmarks

All competitive benchmarking is institutionalized cheating.
The purpose of competitive benchmarking a computer system is to beat everyone else on performance, so you can say "mine is bigger than yours." It's the IT equivalent of war! Benchmark run-rules were made to be broken or at least bent; just don't get caught. This must be true because industrial benchmark organizations like SPEC.org and TPC.org have technical review committees that look for cheating in submitted results.
For capacity planning and system sizing, you need to be aware of this fact of life and look for the loopholes in published benchmark results.
Here, competitive refers to industrial benchmarks, as opposed to benchmarking that you might do for purely internal comparisons or diagnostic purposes.

1.22  Failure to Communicate

"What we've got here is failure to communicate." —Prison warden in the movie Cool Hand Luke.
The purpose of your capacity planning presentation is to communicate the findings of your analysis. No doubt, it takes a long time to do the analysis correctly. So, the last thing you want is a FAIL.
Question:
How long should you spend creating your capacity planning presentation?
Rule of thumb:
You should spend as much time refining the presentation of your capacity planning results as you did reaching them.
If your audience is missing the point, or you don't really have one because you didn't spend enough time developing it, you just wasted a lot more than the time allotted for your presentation.
That it takes time—a lot of time—to hone your point, is captured nicely in the following quote due to a famous French mathematician:
"Je n'ai fait celle-ci plus longue que parce que je n'ai pas eu le loisir de la faire plus courte." —Blaise Pascal (Dec 4, 1656).
Translation: I have made it [this letter] longer because I have not had time to make it shorter.

1.23  Consolidation

Guerrilla law of consolidation: Remove it and they will come!
When consolidating applications, it's not the migration that's the problem, it's the often mistaken motivation. The fallacious reasoning goes like this. Since there are a slew of servers that are only running at 10% busy, migrate those apps onto fewer (virtualized) servers. Naturally, the target server capacity will be better utilized but the application response time (SLA) has now gone out the window.
When you remove the necessary MIPS, the users will come ... and complain.

1.24  Control Freaks Unite!

Your own applications are the last refuge of performance engineering.
Control over the performance of hardware resources e.g., CPUs and disks, is progressively being eroded as these things simply become commodity black boxes viz., multicore processors and disk arrays. This situation will only be exacerbated with the advent of Internet-based application services. Software developers will therefore have to understand more about the performance and capacity planning implications of their designs running on these black boxes. (See Sect. 3)

1.25  Productivity

If you want to be more productive, go to sleep.
Thanks to the Puritans, presumably, American corporate culture is obsessed with the false notion that being busy is being productive. Wrong! Europeans (especially the Mediterraneans) understand the power of the cat nap. After nearly 400 years, it's time for America to get over it.
"Creativity is the residue of time wasted."A. Einstein
Who works the longest hours?
OECD work hours
Data come from the OECD (Organization for Economic Co-operation and Development). Developing countries often work longer hours, but working longer doesn't necessarily mean working better. Interestingly, the USA is in the middle along with Australia.

1.26  Art vs. Science

When it comes to the art of performance analysis and capacity planning, the art is in the science.
A number of recent books and presentations on performance analysis and capacity planning have appeared with "The Art of..." in the title. In itself, this is not new. The title of Raj Jain's excellent 1991 book is The Art of Computer Systems Performance Analysis. Nonetheless, they all resort to various scientific techniques to analyze performance data. The application of science inevitably involves some art.

1.27  ITIL for Guerrillas

Q: What goes in the ITIL box: Business Metrics → Service Delivery → Service Level Management → Capacity management?
ITIL framework
Source: Guerrilla Capacity Planning (Springer 2007)
A: Capacity management = GCaP.
The ITIL framework is all about defining IT processes to satisfy business needs, not the implementation of those processes. That's what makes Guerrilla Capacity Planning (GCaP) and GCaP training excellent ITIL business solutions.

1.28  Performance Paradox

Almost by definition, performance management contains a hidden paradox:
If you do performance management perfectly, you run the risk of becoming invisible and therefore expendable to management.
In other words, having successfully supported performance management, a manager could eventually feel justified in asking: "Why is my budget paying for performance management when everything is performing perfectly?" (read: career-limiting). On the other hand, if performance sucks, that's a performance management problem. Moral: Perfect is not a requirement.
Compare this situation with software development. If a developer does their job perfectly, they risk being overburdened with more work than they can handle (read: job security). If the application breaks, that's a software development problem (read: job security).

1.29  Dumb Question

The only dumb question is the one never asked.

1.30  Quantum Leap

A quantum leap is neither. It can't be both quantal (the correct adjective) and a leap. So, it's an oxymoron.
If it were quantal, it would be infinitesimal (on the order of 10−10 meters) and therefore not observable by us. If the leap were of the regular observable variety, it could not have a quantum magnitude.
Quantum transitions in energy are only associated with the discrete spectrum of atomic or molecular bound states.
Try to avoid communication nonsensica (GMantra 1.31).

1.31  Don't Be a Sheeple

People who follow "Thought leaders" presumably have the intellect of Orwellian lemmings.
Think for yourself and think critically about what other people tell you (including me).

1.32  Good Communication

Quantum leap (GMantra 1.30) is right up there with other moronic phrases like "sea change" (what IS that?) and "moving forward"—who draws attention to moving backwards?
That last one was tweeted and ended up as entry #136 in David Pogue's Twitter Book:
Pogue tweet
My most recent favorite is this one. WAYNE SWAN (politician): ``I will not rule anything in or rule anything out.''
Ruling out, I get: take a ruler and draw a line through it. But how do you rule something in!?
And Mr. Swan didn't invent it. He's just mindlessly repeating it because he heard other boffins say it, and I'm quoting him because it was captured in a transcript. (Double jeopardy)
Good communication, which is vital for good performance analysis and capacity planning, requires that you be a shepherd, not a sheeple. Don't use nonsensical phrases just because everyone else does. Besides, it makes you sound like an idiot ... or worse: a politician.

1.33  Performance Gatekeepers

Performance analysis is too complex and important to be left to enthusiastic individuals. Performance specialists should act as gatekeepers.
A common situation in big organizations is for various groups to be responsible for the performance evaluation of the software or hardware widgets they create. In principle, this is a good thing, but there is a downside. Over-zealous performance optimization of any subsystem can de-optimize the entire system. To avoid this side-effect, a separate central group should be responsible for oversight of total system performance. They should act as both reviewers and gatekeepers for the performance analyses produced by all the other groups in the organization.

1.34  Performance Analysis is a Money Sink

There is a serious misconception that precautions like security management are part of the cost of doing business, but performance analysis actually costs business. In other words, performance anything is perceived as a cost center, or money down the drain.
money sink
Remember the performance homunculus in Section 1.8.
Unfortunately, there is some justification for this view. Performance activities like: can be seen as inflating schedules and therefore delaying expected revenue. See section 1.7. Moreover, there can be an incentive to charge for the "performance upgrade" further down the line.
Better to be aware of these perceptions than be left wondering why your performance initiatives are not being well received by management.

1.35  Tyranny of the 9s

You've no doubt heard of the Tyranny of the 9s, but how about subjugation to the sigmas?
 Nines  Percent  Downtime/Year   σ Level 
4 99.99%   52.596 minutes 
5 99.999%   5.2596 minutes  -
6 99.9999%   31.5576 seconds 
7 99.99999%   3.15576 seconds  -
8  99.999999%   315.6 milliseconds 
The following R function will do the calculations for you.
downt <- function(nines,tunit=c('s','m','h')) {
	ds <- 10^(-nines) * 365.25*24*60*60
	if(tunit == 's') { ts <- 1; tu <- "seconds" }
	if(tunit == 'm') { ts <- 60; tu <- "minutes" }
	if(tunit == 'h') { ts <- 3600; tu <- "hours" }
	return(sprintf("Downtime per year at %d nines: %g %s", nines, ds/ts,tu))
}

> downt(5,'m')
[1] "Downtime per year at 5 nines: 5.2596 minutes"
> downt(8,'s')
[1] "Downtime per year at 8 nines: 0.315576 seconds"

6σ is the "black belt" level that many companies aspire to. The associated σ levels correspond to the area contained under the standard normal (or "bell shaped") curve within that σ interval about the mean. It can be computed using the following R function:
library(NORMT3)
sigp <- function(sigma) {
	sigma <- as.integer(sigma)
	apc <- erf(sigma/sqrt(2))
	return(sprintf("%d-sigma bell area: %10.8f%%; Prob(chance): %e", 
		sigma, apc*100, 1-apc))
}

> sigp(2)
[1] "2-sigma bell area: 95.44997361%; Prob(chance): 4.550026e-02"
> sigp(5)
[1] "5-sigma bell area: 99.99994267%; Prob(chance): 5.733031e-07"

So, 5σ corresponds to slightly more than 99.9999% of the area under in the bell curve; the total area being 100%. It also corresponds closely to six 9s availability.
The last number is the probability that you happened to achieve that availability by random luck or pure chance. A reasonable mnemonic for some of these values is:

1.36  Time Scale Intuition

foo pic Because of the great diversity of time scales that exist in modern computer systems, it's a good idea to try and get a more intuitive feel for some of them. My first attempt at helping you to do that was in 1998, where I included the table shown at the left in The Practical Performance Analyst.

I updated that as Table 3.1 in my Perl PDQ book.

Here, I've rendered those quantities as a data frame in R:

            Subsystem nanoSeconds       secondZ Rescaling  SIunit
1     4 GHz CPU clock    2.50e-01          0.25      1.00       s
2     L1 cache access    5.00e-01          0.50      1.00       s
3     L2 cache access    1.25e+00          1.25      1.00       s
4    Memory bus cycle    2.00e+00          2.00      1.00       s
5    DRAM chip access    6.00e+01         60.00      1.00     min
6  Physical disk seek    3.50e+06    3500000.00      1.33   month
7    Network NFS read    3.20e+07   32000000.00      1.01      yr
8 Database SQL update    5.00e+08  500000000.00     15.84      yr
9  Magnetic tape read    5.00e+09 5000000000.00      1.58 century

To get a better impression of the vast range of time scales, actual nanoseconds are rescaled to be units measured in seconds. In other words, a multiplier of a gigasecond has been applied to the actual time values to make them human scale.

The key point is that since a CPU processor operates on a nanosecond time-scale in the real world or ``seconds'' on the rescaled human level, RAM accesses take on the order of minutes, disk accesses take on the order of a month, database accesses take years, and magnetic-tape IOs take on the order of a century.

Looked at from the standpoint of the CPU processor clock, this gives a somewhat different slant to the expression: Hurry up and wait!

1.37  Throughput vs. Latency

The conventional wisdom that bandwidth and latency are independent metrics is wrong or, at the very least, misleading. The standard example offered is flow through a pipe:
  1. Bandwidth (or throughput) corresponds to the diameter of the pipe.
  2. Latency corresponds to the length of the pipe (implicit time to flow through it at a constant rate).
Since each of these dimensions can be varied independently of one another, so the argument goes, throughput and latency are independent metrics.
In performance analysis, however, these two dimensions are not only related, they are related nonlinearly by Little's law. Like the earth, the world of performance is also curved, not flat.
On closer scrutiny, one finds many problems with the conventional wisdom:
  1. Bandwidth and latency are ill-defined terms.
  2. Latency refers to some kind of (heavily context-dependent) delay, e.g., disk latency means something completely different from packet latency.
  3. Queueing theory is very specific about different latencies: waiting time, service time, response time, etc.
  4. Bandwidth usually refers to a special case of throughput, i.e., maximum throughput.
  5. Throughput and latency only appear to be unrelated in clocked systems; like flow through a pipe (see above).
  6. If you apply the conventional wisdom to database performance, for example, you will be surprised to find that latency (e.g., response time) increases with throughput.
Using the "flat" approximation between throughput and latency may be appropriate for certain systems, as long as the more global truth doesn't go unrecognized.
See my blog for a deeper discussion of this topic:
  1. Bandwidth and Latency are Related Like Overclocked Chocolates
  2. Bandwidth vs. Latency - The World is Curved
  3. Little's law is a curved surface in 3-D

1.38  The Zeroth Performance Metric

Time is the zeroth performance metric. See Chap. 3 of Analyzing Computer System Performance with Perl::PDQ.

1.39  Only Three Performance Metric

There are only three performance metrics:
  1. TIME
    Example metrics: waiting time (e.g., Oracle DB), latency
    Example Units: cpu-ticks, milliseconds
  2. COUNT
    Example metrics: runqueue length, packets dropped, RSS
    Example Units: packets, pages, connections
  3. RATE
    Example metrics: throughput, bandwidth
    Example Units: MIPS, Gbps, GHz
More accurately stated, there are only three classes of metrics. Every performance metric has to fall into one of these three classes or something is wrong. Of these, TIME metrics are the basis of all performance measurement and analysis. See mantra 1.38.

2  PERFORMANCE MODELING RULES OF THUMB

Here are some ideas that might be of help when you're trying to construct your capacity planning or performance analysis models.

2.1  What is Performance Modeling?

All modeling is programming and all programming is debugging.
Similarly seen on Twitter:
"90% of coding is debugging. The other 10% is writing bugs."

2.2  Monitoring vs. Modeling

The difference between performance modeling and performance monitoring is like the difference between weather prediction and simply watching a weather-vane twist in the wind.

2.3  Keep It Simple

Nothing like jumping into the pool at the deep end. Just don't forget your swimming togs in the excitement. To paraphrase Einstein:
A performance model should be as simple as possible, but no simpler!
Someone else said:
"A designer knows that he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away." —Antoine de St-Expurey
I now tell people in my Guerrilla classes, despite the fact that I repeat this rule of thumb several times, you will throw the kitchen sink into your performance models; at least, early on as you first learn how to create them. It's almost axiomatic: the more you know about the system architecture, the more detail you will try to throw into the model. The goal, in fact, is the opposite.

2.4  More Like The Map Than The Metro

A performance model is to a computer system as a station map is to a metro system.
The station map is an abstraction that has very little to do with the physical railway. It encodes only sufficient detail to enable transit on the rail system from point A to point B. It does not include a lot of irrelevant details such as whether the stations are above ground or below ground, or even their actual geographical proximity. A performance model is a similar kind of abstraction.

2.5  The Big Picture

Unlike most aspects of computer technology, performance modeling is about deciding how much detail can be ignored!

2.6  A Point of Principle

When trying to construct a performance model for a computer system (which may or may not be a queueing model), look for the principle of operation (PoP). If you can't describe the principle of operation in 25 words or less, you probably don't understand it yet.
As an example, consider a classic multi-user time-share OS, e.g., Unix. Its PoP can be stated as follows:
Time-share scheduling gives every user the illusion that they are the only user on the system. [17 words]
All the thousands of lines of code in the OS that implement time-slicing, priority queues, and so forth, are there merely to support that PoP. The performance goal in that case is user response time.
Being able to see things in this unconventional way can really help in the deciding on the simplest kind of performance model.
Now that you get the idea, here are a couple of questions for you to ponder:
  1. Your laptop is also a time-share OS, e.g., Mac OS X, Ubuntu Linux, Windows. Is the PoP for that single-user OS the same or different than it is for a multi-user OS?
  2. Virtualization environments, like VMware or Parallels hypervisors are all the rage, but these hypervisors are nothing more than a kind of global OS with each time-share OS treated as an embedded guest OS. Is the hypervisor PoP the same or different than it is for a time-share OS?
See Chapter 7 of the GCaP book for more information related to these questions.

2.7  Guilt is Golden

Performance modeling is also about spreading the guilt around.
You, as the performance analyst or planner, only have to shine the light in the right place and then stand back while others flock to fix it.

2.8  What is a Queue?

From a computer architecture and performance analysis standpoint, you can think of a queue as a buffer. Sizing the buffer is often associated with meeting performance and capacity requirements.
From a queue-theoretic standpoint, a buffer can be thought of as having either a fixed capacity (the usual case in reality) or an unbounded but not infinite capacity. The latter is very important for finding out what buffer size you really need, rather than what buffer size you think you need.
  1. A queue is a line of customers waiting to be severed—as in " Off with their heads!" (*)
  2. Hardware version: The buffer or queue is implemented as a register, e.g., a memory register.
  3. Software version: The buffer or queue is implemented as a list. In some computer languages it is a separate data type, e.g., Lisp, Mathematica, Perl, etc.
(*) I did mistakenly write "severed" while discussing queues during a Guerrilla class in November, 2002.
In Chapter 2 Getting the Jump on Queueing and Appendix B A Short History of Buffers of my Perl::PDQ book, I point out that a queue is a very appropriate paradigm for understanding the performance of computer systems because it corresponds to a data buffer. Since all digital computer and network systems can be considered as a collection of buffers, their performance can be modeled as a collection of queues, aka: queueing network models, where the word "network" here means circuit; like an electric circuit.
PDQ (Pretty Damn Quick) helps you to sever computer systems with queues.
Queueing theory is a relatively young science, having just turned 100 in 2009.

2.9  Where to Start?

Why not have fun with blocks—functional blocks!
Fun Blocks
One place to start constructing a PDQ model is by drawing a functional block diagram. The objective is to identify where time is spent at each stage in processing the workload of interest. Ultimately, each functional block is converted to a queueing subsystem like those shown above. This includes the ability to distinguish sequential and parallel processing. Other diagrammatic techniques e.g., UML diagrams, may also be useful but I don't understand that stuff and never tried it. See Chap. 6 "Pretty Damn Quick(PDQ) - A Slow Introduction" of Analyzing Computer System Performance with Perl::PDQ.

2.10  Inputs and Outputs

When defining performance models (especially queueing models), it helps to write down a list of INPUTS (measurements or estimates that are used to parameterize the model) and OUTPUTS (numbers that are generated by calculating the model).
Take Little's law Q = X R for example. It is a performance model; albeit a simple equation or operational law, but a model nonetheless. All the variables on the RIGHT side of the equation (X and R) are INPUTS, and the single variable on the LEFT is the OUTPUT. A more detailed discussion of this point is presented in Chap. 6 "Pretty Damn Quick(PDQ) - A Slow Introduction" of Analyzing Computer System Performance with Perl::PDQ.

2.11  No Service, No Queues

You know the restaurant rule: "No shoes, no service!" Well, this is the corresponding PDQ rule: no service, no queues.
If you cannot provide the service time as a parameter, then you can't define the queue as a PDQ node in your performance model. (No matter how much you think it ought to be there.)

2.12  Estimating Service Times

Service times are notoriously difficult to measure directly. Often, however, the service time can be calculated from other performance metrics that are easier to measure.
Suppose, for example, you had requests coming into an HTTP server and you could measure its CPU utilization with some UNIX tool like vmstat, and you would like to know the service time of the HTTP Gets. UNIX won't tell you, but you can use Little's law (U = X S) to figure it out. If you can measure the arrival rate of requests in Gets/sec (X) and the CPU %utilization (U), then the average service time (S) for a Get is easily calculated from the quotient U/X.

2.13  Change the Data

If the measurements don't support your PDQ performance model, change the measurements.

2.14  Closed or Open Queue?

When trying to figure out which queueing model to apply, ask yourself if you have a finite number of requests to service or not. If the answer is yes (as it would be for a load-test platform), then it's a closed queueing model. Otherwise use an open queueing model.

2.15  Opening a Closed Queue

How do I determine when a closed queueing model can be replaced by an open model?
This important question arises, for example, when you want to extrapolate performance predictions for an Internet application (open) that are based on measurements from a load-test platform (closed).
An open queueing model assumes an infinite population of requesters initiating requests at an arrival rate λ (lambda). In a closed model, λ (lambda) is approximated by the ratio N/Z. Treat the thinktime Z as a free parameter, and choose a value (by trial and error) that keeps N/Z constant as you make N larger in your PDQ model. Eventually, at some value of N, the OUTPUTS of both the closed and open models will agree to some reasonable approximation.

2.16  Steady-State Measurements

The steady-state measurement period should on the order of 100 times larger than the largest service time.

2.17  Transcribing Data

Use the timebase of your measurement tools. If it reports in seconds, use seconds, if it reports in microseconds, use microseconds. The point being, it's easier to check the digits directly for any transcription errors. Of course, the units of ALL numbers should be normalized before doing any arithmetic.

2.18  Workloads Come in Threes

In a mixed workload model (multi-class streams in PDQ), avoid using more than 3 concurrent workstreams whenever possible.
Apart from making an unwieldy PDQ report to read, generally you are only interested in the interaction of 2 workloads (pairwise comparison). Everything else goes in the third (AKA "the background"). If you can't see how to do this, you're probably not ready to create the PDQ model.

2.19  Better Than a Crystal Ball

A performance model is not clairvoyant, but it's better than a crystal ball; which is just a worthless piece of glass.
Predicting the future is not the same thing as "seeing" the future. A performance model is just a means for evaluating the data that are provided to it. The purpose of the model is to transform those data into information. The amount of information that can be extracted is intimately dependent on the values of those data. Change the input data and you change the output information. An oft-quoted example is: garbage in, garbage out.
That's an extreme example. More commonly, you may see unexpected modeling results. In that case, the new data do not meet the expectations set by the prior measurements. But that doesn't necessarily imply the model is wrong. More likely something has changed in the measurement system so that it has failed to remain consistent with the initial information contained in the previous measurements.
Having a performance model forces you to ask why that unexpected change has occurred and can anything be done to remove it. Without a performance model, you don't have any expectations or context for asking such questions.

2.20  Patterns and Anti-Patterns

All meaning has a pattern, but not all patterns have a meaning.
Visual example:
patterns
Textual example:
Colorless green ideas sleep furiously. —N. Chomsky (1957)
New research indicates that if a person is not in control of the situation, they are more likely to see patterns where none exist, suffer illusions and believe in conspiracy theories.
In the context of computer performance analysis, the same conclusion might well apply when looking at data collected from a system that you don't understand.

2.21  Interpreting Data

Performance modeling can often be more important for interpreting data than predicting it.
The conventional view of performance models is that they are useful for:
  1. Predicting the future performance of an extant system
  2. Exploring what-if scenarios that may or may not be realistic
A performance model can also be used for interpreting performance measurements. Both the measurements and the model must be consistent or something is wrong and needs to be explained.

2.22  Intuition and Modeling

Intuition is a seductive siren who will let you crash on the rocks of misunderstanding so, better to tie yourself to the mast of math.
Ulysses is the the Greek hero in Homer's Odyssey. On his way back from the Trojan wars, Ulysses orders his men to tie him to the mast of his ship and to plug their own ears so that they will not succumb to the beautiful song of the sirens and be diverted to their deaths. Ulysses, being a typical manager, chooses to be bound and to keep his ears unplugged because he cannot bear the idea of not hearing the sirens' music.

2.23  Load Average

The load average in UNIX and Linux is not your average kind of average. It's actually an exponentially damped moving average (EMA) of the type commonly used in data smoothing. More especially from the standpoint of performance analysis and capacity planning, it's the EMA of the O/S run-queue size.
Although it can be useful to know the load average, it's rather limited in value because it's an absolute metric. For the purposes of comparison, it's even better to know a relative metric like the stretch factor, which is related to the load average. See item 4 below.
For a more complete discussion, see:
  1. See Chapter 4 in 1st edn. of the Perl::PDQ book.
  2. See Chapter 6 in 2nd edn. of the Perl::PDQ book.
  3. Read the original online articles (as PDFs):
    a. How the Load Average Works
    b. Not Your Average Average
    c. Addendum on Hz versus HZ
  4. How to convert the absolute load average to the relative stretch factor metric.  

2.24  VAMOOS Your Data Angst

It's easy to get carried away and jump the gun trying to model your Perf or CaP data ... and get it wrong. :/
Instead, try to follow these basic VAMOOS steps:
  1. Visualize: Make a plot of your data without making any assumptions about how it should look. This is where tools like scatter plots come in.
  2. Analyze: Look for patterns or other significant features in the data and possibly quantify them, e.g., trends in the distribution of data points or periodically repeating features, such as spikes or peaks.
  3. Modelize: Consider different types of models: SWAG, statistical regression, queue-theoretic, simulation, etc. If you are using the Chart>Add Trendline feature in Excel, this is where you choose your model from the Excel dialog box.
    excel
    Don't fret over whether or not it's the "right" choice: there's no such thing, at this point. Whatever your choice, try to make it consistent with steps 1 and 2. If it doesn't work out it doesn't matter because, based on the next step, you're about to come around again. :)
  4. Over and Over: None of this is likely to converge in a single pass.
  5. Satisfied: Iterate your way to success. Repeat until you are satisfied that all your assumptions are consistent.
In other words, VAMOOS stands for Visualize, Analyze, Modelize, Over and Over until Satisfied.

2.25  Measurement Errors

All measurements are wrong by definition.
Measurements are expressed using numbers, but measurement is not the same as mathematics. Mathematics involves exact quantities, e.g., the prime numbers. They are exact by definition. The number 11 is provably a prime number. Exactly. It's not approximately a prime number.
Measured numbers, on the other hand, are the result of a process involving comparisons. It's an estimation procedure and therefore cannot produce exact numbers; despite the fact that measurements are often displayed as though they were exact numbers, e.g., 11 seconds or 23% busy. More on that shortly. The measurement process tries to determine the closeness of the comparison. Since the assessed closeness is always an approximation, it can never be exact (like a prime number) and therefore, must contain errors.
In that sense, measured numbers look more like the mathematical representation of irrational numbers, e.g., π. By definition, π = C/D is the ratio of the circumference of a circle C to its diameter D. We can think of the diameter as being like a ruler or tape measure where the ratio C/D is like asking, how many diameters there are in the circumference? One possible result is the ratio 3/1, i.e., there are 3 diameters in the circumference. But you already know that the exact number 3 is less than the value of π. If the circumference was π meters long, there would be a small gap remaining after you stepped your meter-ruler three times along the circumference.
On the other hand, you might be willing to accept π = 3 as a rough approximation, and live with the gap, depending on how you plan to use that value. For example, it might be good enough to do a quick calculation in your head. Successively better approximations could involve ratios like: 7/2, 22/7, 333/106, 355/113. These ratios are analogous to reading off graduation marks on your ruler to improve the precision of your measurement. But none of these ratios or fractions are exactly equal to π. That's why π is declared to be ir-ratio-nal: unable to be written exactly as a ratio of whole numbers. Nor is a more familiar number, like 3.1415926, any better. That's just an inexact decimal fraction. They are all approximations, with attendant errors.
The more important question is, how big is the error? A secondary question is, can you tolerate that level of error?
It's a bit of an indictment that all performance measurement tools do not display errors. That omission leads people to believe (inadvertently) that a reported number, like 23% CPU busy, is the same number as the 9th prime number (viz., 23). Since all performance metrics are measured (indeed, those two words have the same root word), it should really be displayed as 23 ± 5% (for example) to remind us that there is always error or "wrongness" in the measurement.

2.26  Modeling Errors

All performance models are wrong, but some are wronger than others.
Predicting the future is not the same thing as "seeing" the future, in the sense of seeing what things might look like. See GMantra 2.19 for more on this point. All predictions are merely estimates based on (input) data (or other estimates) and therefore predictions come with errors. The only real question is, how much error can you tolerate?

2.27  Data Ain't Information

Data is not (are not) the same thing as information.
A Google search returns a lot of data, not information. Google admits that: Are you feeling lucky? It should be a certainty, but it's not. Sneakily, Google also knows that your brain will quickly decide what is information in all those links, such that you don't even realize you're doing most of the work which you unconsciously attribute to Google. They know your brain craves patterns.
Collecting performance data is only one half the story: rather like a Google search. You still have to decide what information (if any) is contained in all those data. Unlike a Google search, however, performance data are not simple text (simple for your brain, that is). Performance data usually come in the form of a torrent of numbers which, unlike text, are not simple for your brain to comprehend.
Even worse, without doing the proper analysis, the data can be deceptive and lead you to the wrong conclusion. That's where performance analysis tools and models come in. They act as transformers on the data to help your brain decide what is information.

2.28  Data Science

Data science. Pfft!
Anything that has to call itself a "science," usually isn't. (Social Science?) How about Information Science?
Pretty soon we'll be talking about Information Technology (IT). Oh, wait!...

3  UNIVERSAL SCALABILITY LAW (USL)

This section has grown to the point where it now has its own page updated with all the most recent developments.



File translated from TEX by TTH, version 3.81.
On 20 Jul 2014, 16:55.