Guerrilla Data Analysis Techniques (GDAT)
Guerrilla Data Analysis Techniques (GDAT)
Emphasizing the use of
R statistical tools
and PDQ-R
Contents
1 Purpose
2 Certification
3 Course Goals
4 Dates and Registration
5 Course Outline
5.1 GDAT Day 1
5.2 GDAT Day 2
5.3 GDAT Day 3
5.4 GDAT Day 4
5.5 GDAT Day 5
6 Guest Instructors
7 Terms and Conditions
8 Textbooks
1 Purpose
You already understand the essential concepts of computer system capacity planning
(e.g., Level II certification) and
you've collected cubic light years of performance data. But now you realize that's not sufficient.
Why? Because raw performance data is not the same thing as performance information.
To extract the pertinent information, you need to transform your data.
And that's precisely what this class teaches you.
Moreover, the data analysis techniques we present are general purpose, and therefore
not tied to any particular computing platform or data collection tools.
Although there are no prerequisites, it is strongly recommended that you take the
Level II GCaP class
before embarking on the this Level III GDAT class.
2 Certification
This class corresponds to Guerrilla Capacity Planner: Level III certification.
The levels are defined as:
- Entry level, e.g.,
Guerrilla Boot Camp.
- Exposure to a wide variety of computer systems capacity planning concepts, methods, and
tools that can be adapted opportunistically to support the needs of
enterprise-level platform-independent performance management.
An example class is
Guerrilla Capacity Planning.
- Detailed study of a particular capacity planning technique or performance analysis tool.
A printed certificate reflecting the level of achievement is awarded to each attendee who completes the course.
Although there are no prerequisites, it is strongly recommended that you take the
Level II GCaP class
before embarking on the this Level III GDAT class.
3 Course Goals
After completing this course, students will know how to:
- Transform data into information.
-
Use statistical techniques and tools to reduce the number of metrics that
need to be monitored and analyzed.
-
Apply regression analysis to determine the scalability of web applications and services.
4 Dates and Registration
Check the
schedule
page for the latest information.
Online
registration
is available. Additional registration details are provided at the end of this page.
Who Should Attend
This class is intended for application scientists and engineers,
computer architects, compiler writers, and software engineers who use or
design high-performance computer systems. The level of the presentation is
appropriate for both practitioners and students. Experts from any
scientific discipline will find this class useful in helping to
understand how to appropriately measure and statistically analyze the
performance of their systems and applications.
Content level: 20% beginner, 60% intermediate, 20% advanced.
5 Course Outline
Class begins at 9am and ends at 5pm each day.
A morning break of half an hour is serviced around 10:30am
Seated lunch service is provided from Noon until 1pm.
A serviced afternoon break of half an hour occurs around 3:00pm
A large number of practical exercises (with solutions in R) will be given and discussed throughout
the five days. You are encouraged to bring a laptop computer.
5.1 GDAT Day 1
- How to Detect Bad Data
- All data is wrong by definition
- Broken performance tools
- The power of good statistical models
- Introduction to R
- Why R is de RigueuR on Wall St and elsewhere
- My special 911.r script
- R commands
- R language
- R graphics
- Installing R
- Measurement Errors and Analysis
- Measurement is a process not a number
- Confidence intervals and sigma levels
- Confidence bands and QQ plots
- How to express errors
5.2 GDAT Day 2
- Review of Elementary Statistics
- Descriptive statistics
- Measures of central tendency: mean, median and mode
- Meaning of the means: arithmetic, geometric, harmonic
- Measures of dispersion: stdev, variance, stderr, percentiles
- Summarizing data and its statistics
- Distributions and Histograms
- Review of Uniform, Normal, Poisson, Exponential distributions
- How to determine normal distributions
- How to determine exponential/Poisson distributions
- Weighted multi-class workloads
- Review of Benchmarking and Load Test Tools
- History of industry benchmarks SPEC and TPC
- Steady-state measurement period
- Comparing vendor benchmarks
5.3 GDAT Day 3
- Scalability Analysis
- Load test data and QA analysis
- Universal scalability law
- Analyzing data for scalability zones
- Multivariate Linear and Nonlinear Regression
- ANOVA: Analysis of Variance
- Moving averages
- Web server scalability
- Web traffic profiles and TZ zones
5.4 GDAT Day 4
- Data Mining Techniques for CaP
- Machine learning algorithms
- Support Vector Machines
- Supervised learning
- The svm package in R
- Detecting performance patterns and defining exceptions
- Wild Not Mild Data Distributions
- Power law data and distributions
- Case studies: SQL access patterns, web traffic, data recovery
- Data validation using qqplots, log-linear plots and log-log plots
5.5 GDAT Day 5
- Taming the Data Torrent
- Principal component analysis
- Reducing the number of monitored metrics
- Case studies: PerfViz, Apdex, Barry
- PDQ-R Queueing Modeling Tool
- The statistics of queues
- Case study: Modeling networked storage
- Case study: Multi-tier e-commerce data and PDQ analysis
- Review and Class Discussion
6 Guest Instructors
From time to time, depending on availability and location, one or more of the following guest instuctors may be involved:
- David Lilja:
-
David is currently Professor of Electrical and Computer Engineering
at the University of Minnesota in Minneapolis. His expertise lies in the
application of statistical analysis of computer simulations and design of
experiments.
- Jim Holtman:
-
Jim previously worked at Bell Labs when R (then called S) was being developed
and has therefore essentially been using R since its inception. Currently, he is
working with the research group at Kroger's grocery stores doing simulations of
scheduling protocols using R.
- Stephen O'Connell:
-
Stephen is something of a model Guerrilla student in that he took to R in earnest after attending a GDAT class several years ago.
Since then he has also learnt how to apply machine learning tools in R to computer system performance and capacity planning data.
7 Terms and Conditions
Tuition Fees
Please consult the
Class Schedule
page for current pricing and conditions.
Transportation
Information will be sent upon receipt of enrollment. A packet will include
airport and transportation options.
Reservations
All confirmed reservations must be must be
accompanied by a purchase order number, a check for the tuition, or credit card
information for billing. Courtesy Reservations will be held for up to 30 days in
order for paperwork to be processed so long as there is suffcient time and
adequate space in thecourse.
8 Textbooks
Currently, there is no official textbook for this class.
Location
Please consult the
Class Schedule
for hotel location details.
The city of Pleasanton is right next door to Castro Valley.
Meals
Breakfast, lunch, morning and afternoon breaks will be catered for by the hotel each day. See the
Mini Survival Guide
explaining how to get to the hotel and a list of local restaurants to eat at, once you do.
File translated from
TEX
by
TTH,
version 3.38.
On 15 May 2012, 11:50.