Statistical Methodology¶
BenchForge's statistical methods are designed for research-grade rigor. This document describes the measurement, recording, and analysis pipeline so that results can be independently verified and correctly interpreted.
Latency Measurement¶
Clock Source¶
All latency measurements use time.perf_counter_ns() — a monotonic,
nanosecond-resolution clock. This is the highest-resolution timer available
in Python and is not affected by system clock adjustments.
Measurement Point¶
Latency is measured externally by the runner, not by the worker:
t0 = time.perf_counter_ns()
worker.execute(step)
t1 = time.perf_counter_ns()
latency = t1 - t0 # nanoseconds
This captures the full round-trip cost including: - Driver-level query preparation - Network round-trip to the database - Query execution on the server - Result deserialization
GC Control¶
Garbage collection is disabled during the measurement window
(gc.disable() / gc.enable()) to prevent GC pauses from contaminating
latency measurements. This is standard practice in microbenchmarking.
HDR Histogram¶
BenchForge records latencies using an in-house HDR (High Dynamic Range) histogram — a log-linear data structure that provides:
- O(1) record — constant-time insertion regardless of value
- O(1) merge — constant-time merge across threads and iterations
- Configurable precision — relative error bounded by
1 / 2^sub_bucket_bits - Zero external dependencies — implemented in pure Python
How It Works¶
The histogram divides the value range into major buckets (powers of 2),
each subdivided into 2^sub_bucket_bits linear sub-buckets:
Major bucket 0: [1, 2) → 2048 sub-buckets
Major bucket 1: [2, 4) → 1024 sub-buckets (lower half shared)
Major bucket 2: [4, 8) → 1024 sub-buckets
...
Major bucket k: [2^k, 2^(k+1)) → 1024 sub-buckets
With the default 3 significant digits, sub-bucket count is 2048, giving a worst-case relative error of ~0.1%. For benchmark latencies (typically microseconds to seconds), this precision far exceeds measurement noise.
Percentile Computation¶
Percentiles are computed by linear scan of the flat counts array:
target_count = ceil(q / 100 * total_count)
running = 0
for idx in range(counts_len):
running += counts[idx]
if running >= target_count:
return value_from_linear_index(idx)
This is O(buckets) but only runs at report time, not during measurement.
Mean and Standard Deviation¶
For mean and stdev, each bucket's contribution uses the midpoint of its value range (value to highest_equivalent) for better accuracy:
Reservoir Sampling¶
For ECDF (Empirical Cumulative Distribution Function) plots, BenchForge maintains a bounded reservoir sample (max 10,000 values) using Vitter's Algorithm R:
- First 10,000 values are stored directly
- For value n (where n > 10,000): replace a random element with
probability
10000 / n
This produces a statistically representative sample regardless of total operation count, with bounded memory usage.
Time-Series Collection¶
Latencies are also bucketed into 1-second windows for time-series analysis:
elapsed_ns = t1 - start_ns
second = int(elapsed_ns // 1_000_000_000)
time_buckets[step_name][second].append(latency)
Each window records:
- ops — number of operations completed
- errors — number of errors
- p50_ns, p95_ns, p99_ns — latency percentiles within the window
Time-series data reveals temporal patterns (warmup effects, periodic GC, connection pool behavior) that aggregate statistics obscure.
Cross-Iteration Statistics¶
When running multiple iterations (experiment.iterations > 1), BenchForge
computes per-step aggregate statistics across iterations:
Metrics¶
For each step metric (ops, throughput, p50, p95, p99, p999):
| Statistic | Formula |
|---|---|
| Mean | Arithmetic mean of per-iteration values |
| Stdev | Sample standard deviation (ddof=1) |
| CV | Coefficient of variation: stdev / mean |
| 95% CI | Bootstrap confidence interval (see below) |
Why Iteration-Level Aggregation?¶
Cross-run statistics are computed over iteration-level metrics (e.g., throughput per iteration), not by pooling all individual operations across iterations. This is because:
- Each iteration is an independent sample
- Pooling operations ignores between-iteration variance
- Iteration-level aggregation correctly captures run-to-run variability
This matches the methodology recommended by database benchmarking literature.
Bootstrap Confidence Intervals¶
BenchForge uses the bootstrap percentile method for confidence intervals:
Algorithm¶
- Given n iteration-level values: [v₁, v₂, ..., vₙ]
- Draw 10,000 bootstrap resamples (sample with replacement, size n)
- Compute the mean of each resample
- Sort the 10,000 bootstrap means
- The 95% CI is the [2.5th percentile, 97.5th percentile] of bootstrap means
Why Bootstrap?¶
- Distribution-free — makes no assumption about the underlying distribution
- Works with small samples — valid even with 3-5 iterations
- Captures asymmetry — unlike t-intervals, the CI can be asymmetric
Reproducibility¶
When a seed is provided, the bootstrap RNG is seeded identically, producing the same confidence intervals across runs.
Single-Iteration Behavior¶
With only 1 iteration, the CI degenerates to a point (low = high = value). For meaningful confidence intervals, use at least 3 iterations; 5+ is recommended for publication-quality results.
Comparison Analysis¶
bench compare computes ratios between baseline and contender:
| Metric | Formula | Interpretation |
|---|---|---|
| p50 ratio | contender.p50 / baseline.p50 | < 1.0 = faster |
| p95 ratio | contender.p95 / baseline.p95 | < 1.0 = faster |
| p99 ratio | contender.p99 / baseline.p99 | < 1.0 = faster |
| Throughput ratio | contender.throughput / baseline.throughput | > 1.0 = faster |
Verdict Thresholds¶
| p95 Ratio | Verdict |
|---|---|
| ≤ 0.95 | Faster — contender is at least 5% faster |
| ≥ 1.05 | Slower — contender is at least 5% slower |
| 0.95 < ratio < 1.05 | Same — within noise margin |
Limitations and Caveats¶
-
Closed-loop model: BenchForge drives queries back-to-back (no think time). This measures peak throughput, not typical application behavior.
-
Python GIL: Concurrent worker threads share the GIL. CPU-bound operations may not scale linearly with concurrency. For I/O-bound database queries, this is typically not a bottleneck.
-
Coordinated omission: Because BenchForge does not inject artificial delays, it does not suffer from coordinated omission in the classic sense. However, results represent "service time," not "response time" under load.
-
Histogram quantization: Latency values are quantized to histogram bucket boundaries. With 3 significant digits, the worst-case relative error is ~0.1%, which is negligible compared to typical measurement noise.