Skip to content

Evidence Levels

A core differentiator of this repository is the explicit tagging of evidence levels. Every claim in an experiment is categorized by the strength of its supporting evidence.

Tag definitions

Tag Definition Usage
Observed Directly seen in logs, metrics, or system behavior during the experiment Facts that any observer would agree on
Measured Quantitatively confirmed with specific numerical values Precise data points with units and timestamps
Correlated Two or more signals changed together, but a causal link is not established Temporal or statistical association only
Inferred A reasonable conclusion drawn from observations, using domain knowledge Logical deduction, not direct observation
Strongly Suggested Multiple evidence sources point to the same conclusion, but definitive proof is missing High confidence, but alternative explanations exist
Not Proven The hypothesis was tested but the evidence did not confirm it Negative or inconclusive result
Unknown The available data is insufficient to make any determination Honest acknowledgment of a gap

Examples

Observed:

"During the test window, the container restart count incremented from 0 to 3."

Measured:

"p99 response time increased from 120 ms to 3,400 ms between 14:02 and 14:08 UTC."

Correlated:

"The memory increase from 60% to 92% coincided with the response time degradation, but both could be caused by an independent third factor."

Inferred:

"Given that the OOM killer log entry appeared 2 seconds before the container restart, the restart was most likely triggered by the OOM kill."

Strongly Suggested:

"All observed indicators — swap usage increase, page fault rate spike, and CPU system time rise — are consistent with swap thrashing as the root cause. However, we did not directly measure disk I/O to the swap device."

Not Proven:

"We hypothesized that the timeout was platform-side, but the experiment showed the worker was still processing the request at the time of the 504 response."

Unknown:

"The root cause of the intermittent 10-second delay between request receipt and handler invocation could not be determined from the available telemetry."

Why this matters

In support scenarios, the distinction between what is observed and what is inferred has direct consequences:

  • Over-claiming erodes trust. Telling a customer "the platform caused this" when the evidence only suggests correlation can damage credibility and lead to incorrect remediation.
  • Under-claiming wastes time. Refusing to draw any conclusion when the evidence strongly suggests a cause forces unnecessary data collection cycles.
  • Calibrated confidence builds trust. Stating "the data is consistent with X, and we recommend investigating Y to confirm" gives customers a clear next step without overcommitting.

Usage in experiments

In the Interpretation section of each experiment, tag significant claims:

The response time increase [Measured] coincided with the memory pressure event [Correlated].
Based on the OOM kill log entry and subsequent restart timing, the restart was most likely
triggered by memory exhaustion [Inferred].

In the What this proves / What this does NOT prove sections, evidence tags help enforce discipline:

  • "What this proves" should contain only Observed, Measured, or Strongly Suggested claims
  • "What this does NOT prove" should reference Correlated, Not Proven, or Unknown items