Skip to content

Statistical Methods for Performance Experiments

This page defines the statistical methodology for experiments where outcomes vary across runs. It complements the Experiment Framework and Evidence Levels.

Experiment type classification

Every experiment must declare its type before execution begins.

Type Definition Qualifying test Repetition rule
Config Outcome is deterministic once preconditions are fixed Given identical setup, the success/failure result is the same every time Single run sufficient
Performance Outcome varies across runs due to shared infrastructure, noisy neighbors, or timing Same setup can produce different latency, throughput, or error rates across runs Multiple independent runs required
Hybrid Contains both deterministic and variable components Config aspect determines if feature works; Performance aspect measures how well Split: validate Config first (single run), then measure Performance (multiple runs)

Classification examples

Experiment Type Reasoning
Custom DNS resolution drift Config DNS either resolves or it doesn't — deterministic
Target port auto-detection Config Port detection succeeds or fails based on configuration
SNAT exhaustion under load Performance Connection failure rate varies with platform load and timing
Cold start latency breakdown Performance Startup duration varies across invocations
Managed identity RBAC propagation Hybrid RBAC either works (Config), but propagation delay varies (Performance)

Run requirements

Minimum independent runs

Decision impact Minimum runs Recommended Evidence level cap
Informational (blog, knowledge base) 3 5 Directional at 3; Confirmed at 5+
Operational (support playbook) 5 7 Confirmed at 5+ consistent
Critical (architecture decision) 7 10+ Confirmed at 7+ consistent

Evidence ceiling

A performance experiment with fewer than 3 independent runs caps at Correlated evidence level regardless of how clear the signal appears. Single-run performance data is Observed only — never Measured or Strongly Suggested.

What counts as an independent run

Each run must be independent in the statistical sense:

  • New deployment or full container restart between runs (not just request replay)
  • Separated by at least 5 minutes to avoid warm cache or connection pool carryover
  • Same stimulus profile: identical request count, payload, concurrency, and duration
  • Logged independently: each run produces its own raw data file

Cold start experiments

For cold start measurements, each run must include a true cold event (scale from zero or fresh deployment). Warm follow-ups within the same run are part of that run's data, not separate runs.

Metrics to report

Primary metrics table

Every performance experiment must include this table in the Results section:

Metric Config A Config B Unit
Runs (n) count
Median (p50) ms / % / count
p95 ms / % / count
p99 ms / % / count
IQR ms / % / count
Min ms / % / count
Max ms / % / count

Why median over mean

Cloud workload latency distributions are typically right-skewed with heavy tails. The mean is distorted by outliers (a single 30-second timeout inflates the mean of 100 sub-second requests). The median is robust to these outliers and better represents "typical" behavior.

Always report median as the primary central tendency. Report mean only as a supplementary metric, with a note about skew if mean > 1.5× median.

Per-run vs per-request metrics

Granularity Use when Example
Per-run summary Comparing configurations Median of each run's p95 latency
Per-request detail Analyzing distributions All individual request durations from a single run

When comparing configurations, compute per-run summaries first (e.g., each run's median latency), then compare those summaries across runs. This avoids pseudo-replication (treating 1000 requests from one run as 1000 independent observations).

Warm-up exclusion

Protocol

Experiment type Warm-up rule Rationale
Steady-state latency Exclude the longer of: first 2 minutes or first 100 successful requests JIT compilation, connection pool warm-up, DNS cache population
Cold start / scale-to-zero No exclusion — cold period IS the measurement The cold event is the subject under study
Throughput / load test Exclude ramp-up phase until target concurrency is reached Load generator stabilization

Recording warm-up data

Preserve, don't discard

Warm-up data must be recorded and preserved in raw data files. Exclusion happens at analysis time, not collection time. The raw data directory must contain the complete dataset. Mark the warm-up boundary in the analysis output:

warm_up_boundary:
  method: "first 2 minutes"
  excluded_requests: 87
  excluded_until: "2026-04-10T14:02:00Z"

Outlier policy

Decision tree

Is the outlier consistent across multiple runs?
├── YES → It is a real tail behavior, not an outlier. INCLUDE it.
│         Report it in p99 and note its frequency.
└── NO  → Present in only 1 of N runs?
    ├── Identifiable external cause? (platform event, deployment, unrelated alert)
    │   └── YES → EXCLUDE the affected run. Note the exclusion reason.
    │             Replace with an additional run if below minimum run count.
    └── No identifiable cause?
        └── INCLUDE it. Outliers without explanation may be the finding.
            Report with and without the outlier for transparency.

Documenting exclusions

Every excluded data point or run must be documented:

### Excluded runs

| Run | Reason | Evidence |
|-----|--------|----------|
| Run 3 | Platform deployment event during measurement window | Activity log entry at 14:05 UTC |

Comparison methodology

When comparing two configurations (A vs B)

Step 1: Visual comparison

Create a box plot of per-run summaries (median or p95) for each configuration. If the boxes do not overlap, the difference is likely meaningful.

Step 2: Effect size

Calculate the difference in medians between configurations, expressed as a percentage:

Effect size = (Median_B - Median_A) / Median_A × 100%
Effect size Category Interpretation
< 10% Negligible Within normal cloud variance
10–25% Small Noticeable but may not warrant action
25–50% Medium Likely operationally significant
> 50% Large Strong signal, high confidence in real difference

Tail-sensitive metrics

For SLO-relevant metrics (p95, p99), use a lower threshold: >10% difference in p95 is operationally significant even if medians differ by less.

Step 3: Statistical test (when n ≥ 5 per group)

Use the Mann-Whitney U test on per-run medians:

  • Non-parametric — no assumption about distribution shape
  • Appropriate for small sample sizes (5–10 runs per group)
  • Report U statistic and p-value
  • Significance threshold: p < 0.05
from scipy.stats import mannwhitneyu

# per_run_medians_A = [median_latency for each run of config A]
# per_run_medians_B = [median_latency for each run of config B]

stat, p_value = mannwhitneyu(
    per_run_medians_A,
    per_run_medians_B,
    alternative='two-sided'
)

Step 4: Confidence interval (bootstrap)

When formal tests are not applicable (n < 5), compute a bootstrap 95% confidence interval on the difference in medians:

import numpy as np

def bootstrap_ci(a, b, n_bootstrap=10000, ci=0.95):
    diffs = []
    for _ in range(n_bootstrap):
        sample_a = np.random.choice(a, size=len(a), replace=True)
        sample_b = np.random.choice(b, size=len(b), replace=True)
        diffs.append(np.median(sample_b) - np.median(sample_a))
    lower = np.percentile(diffs, (1 - ci) / 2 * 100)
    upper = np.percentile(diffs, (1 + ci) / 2 * 100)
    return lower, upper

If the 95% CI excludes zero, the difference is statistically meaningful.

Reporting comparison results

Use this template in the Interpretation section:

**Comparison: Config A vs Config B**

- Effect size: +42% increase in p50 latency (Medium)
- Mann-Whitney U: U=2.0, p=0.016 (n_A=5, n_B=5)
- Bootstrap 95% CI for difference in medians: [+85ms, +210ms]
- Conclusion: Config B shows statistically significant higher latency [Measured]

Visualization standards

All charts use Vega-Lite. The following chart types are required for performance experiments:

1. Box plot — run-level comparison

Shows distribution of per-run summaries across configurations.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "Template: per-run metric comparison", "background": "#FFFFFF", "padding": 12, "config": { "view": {"fill": "#FFFFFF", "stroke": "#D1D5DB"}, "axis": {"labelColor": "#111827", "titleColor": "#111827", "gridColor": "#E5E7EB", "domainColor": "#94A3B8", "tickColor": "#94A3B8", "labelFontSize": 12, "titleFontSize": 13}, "legend": {"labelColor": "#111827", "titleColor": "#111827", "orient": "top"}, "title": {"color": "#111827", "anchor": "start", "fontSize": 16, "subtitleColor": "#475569", "subtitleFontSize": 12} }, "width": 500, "height": 300, "data": {"values": []}, "mark": "boxplot", "encoding": { "x": {"field": "configuration", "type": "nominal", "title": "Configuration"}, "y": {"field": "value", "type": "quantitative", "title": "Metric (unit)"}, "color": {"field": "configuration", "type": "nominal"} } }

2. Time series with percentile bands

Shows metric behavior over time with p50 line and p5–p95 shaded band.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "Template: time series with percentile bands", "background": "#FFFFFF", "padding": 12, "config": { "view": {"fill": "#FFFFFF", "stroke": "#D1D5DB"}, "axis": {"labelColor": "#111827", "titleColor": "#111827", "gridColor": "#E5E7EB", "domainColor": "#94A3B8", "tickColor": "#94A3B8", "labelFontSize": 12, "titleFontSize": 13}, "legend": {"labelColor": "#111827", "titleColor": "#111827", "orient": "top"}, "title": {"color": "#111827", "anchor": "start", "fontSize": 16, "subtitleColor": "#475569", "subtitleFontSize": 12} }, "width": 500, "height": 300, "layer": [ { "mark": {"type": "area", "color": "#93C5FD", "opacity": 0.25}, "encoding": { "x": {"field": "timestamp", "type": "temporal"}, "y": {"field": "p5", "type": "quantitative"}, "y2": {"field": "p95"} } }, { "mark": {"type": "line", "color": "#2563EB", "strokeWidth": 3}, "encoding": { "x": {"field": "timestamp", "type": "temporal"}, "y": {"field": "p50", "type": "quantitative", "title": "Metric (unit)"} } } ], "data": {"values": []} }

3. Scatter plot — individual runs

Shows each run as a point, useful for detecting clusters or outliers.

Use for: run-to-run variability, identifying if one run is clearly anomalous.

Raw data preservation

Directory structure

All raw data is stored in the repository under data/:

data/
├── app-service/
│   └── {experiment-slug}/
│       ├── run-001/
│       │   ├── requests.csv          # Per-request data
│       │   ├── metrics.json          # Azure Monitor metrics export
│       │   ├── traces.json           # Application Insights traces
│       │   └── metadata.yaml         # Run metadata
│       ├── run-002/
│       └── analysis/
│           ├── summary.csv           # Aggregated per-run summaries
│           └── comparison.json       # Statistical test results
├── functions/
│   └── {experiment-slug}/
│       └── ...
├── container-apps/
│   └── {experiment-slug}/
│       └── ...
└── cross-cutting/
    └── {experiment-slug}/
        └── ...

Run metadata schema

Each run directory must contain a metadata.yaml file:

experiment: snat-exhaustion
service: app-service
run_number: 1
date: 2026-04-10
start_time: "14:00:00Z"
end_time: "14:45:00Z"
configuration:
  sku: P1v3
  region: koreacentral
  runtime: python-3.11
  custom: {}                  # experiment-specific settings
warm_up:
  method: "first 2 minutes"
  excluded_until: "14:02:00Z"
  excluded_requests: 87
environment:
  az_cli_version: "2.83.0"
  core_tools_version: "4.8.0"
  os: linux
notes: ""

File format conventions

Data type Format Reason
Per-request timings CSV Easy to load in pandas, Excel, R
Azure Monitor metrics JSON Native export format
Application Insights traces JSON Native KQL export format
Run metadata YAML Human-readable, git-friendly
Analysis summaries CSV Tabular, easy to aggregate
Statistical test results JSON Structured, machine-readable

Evidence level mapping

Performance experiment evidence levels depend on run count and consistency:

Runs Result consistency Maximum evidence level
1 N/A Observed — single data point, no statistical power
2 Both agree Correlated — suggestive but insufficient
3 All agree Inferred — directional evidence
3 2 of 3 agree Correlated — inconsistent signal
5+ All agree, effect size ≥ Medium Measured — quantitatively confirmed
5+ Statistical test p < 0.05 Strongly Suggested — strong evidence
5+ Inconsistent or small effect Correlated — signal exists but weak

Practical implication

Most experiments in this repository target 5 independent runs per configuration. This is the minimum threshold for using formal statistical tests and achieving Measured or Strongly Suggested evidence levels.

Reporting template for performance experiments

The Results (section 10) and Interpretation (section 11) of performance experiments must include the following structure:

Section 10: Results — performance experiment additions

### Experiment type

**Performance** — results vary across runs due to [specific variance source].

### Run summary

| Run | Date | Duration | Requests (total) | Requests (after warm-up) | Notes |
|-----|------|----------|-------------------|--------------------------|-------|
| 1   | 2026-04-10 | 30 min | 1200 | 1113 | — |
| 2   | 2026-04-10 | 30 min | 1200 | 1108 | — |
| ...   | | | | | |

### Primary metrics

| Metric | Config A | Config B | Unit |
|--------|----------|----------|------|
| Runs (n) | 5 | 5 | count |
| Median (p50) | 142 | 203 | ms |
| p95 | 310 | 890 | ms |
| p99 | 580 | 2100 | ms |
| IQR | 85 | 340 | ms |

### Raw data

Raw data for all runs is available in [`data/{service}/{experiment}/`](link).

Section 11: Interpretation — performance experiment additions

### Statistical analysis

- **Effect size**: +43% increase in median latency (Medium)
- **Mann-Whitney U test**: U=2.0, p=0.016 (n_A=5, n_B=5)
- **Bootstrap 95% CI**: [+45ms, +78ms] for difference in medians
- **Conclusion**: The difference is statistically significant and operationally meaningful.

### Confidence statement

This finding is based on N=5 independent runs per configuration with consistent
results across all runs. Evidence level: [Measured].

Retrofitting existing experiments

Published experiments that report single-run performance data should be annotated:

!!! warning "Statistical limitation"
    This experiment reports results from a single execution run. Performance
    metrics are **Observed** (not **Measured**) and should be treated as
    directional only. Re-execution with multiple independent runs is planned
    to achieve statistical confidence.

This annotation does not invalidate existing results — it calibrates reader expectations.

See Also