Common Metric Misreads¶

Specific cases where Azure platform metrics are misinterpreted, leading to incorrect conclusions or misdirected troubleshooting.

1. Plan-level vs. instance-level metrics¶

Metric: App Service plan CPU or memory percentage.

Common misread: "The plan is at 85% CPU, so all apps are under pressure."

Correct interpretation: Plan-level metrics are aggregated across all instances. A single hot instance at 100% CPU and three idle instances at 0% will show 25% at the plan level — hiding the problem. Conversely, plan-level 85% may mean all instances are evenly loaded at 85%, or one instance is saturated while others are moderate.

Action: Always check per-instance metrics. In Azure Monitor, filter by instance name to see the distribution.

2. Average vs. percentile response time¶

Metric: Average response time in Application Insights or Azure Monitor.

Common misread: "Average response time is 200ms, so performance is fine."

Correct interpretation: Average hides tail latency. If 99% of requests complete in 50ms but 1% take 15 seconds, the average may look acceptable while 1% of users experience severe degradation. At scale, 1% can mean thousands of affected requests per hour.

Action: Always check p95 and p99 latency. Use Application Insights percentile queries or KQL percentile() functions.

3. CPU percentage on multi-core instances¶

Metric: CPU percentage on an instance with multiple cores.

Common misread: "CPU is at 50%, so there's plenty of headroom."

Correct interpretation: On a 2-core instance, 50% CPU can mean one core is fully saturated while the other is idle. If the application is single-threaded (common in Node.js, Python without multiprocessing), the saturated core is the bottleneck. The aggregate metric hides core-level saturation.

Action: Check per-core utilization if available (procfs /proc/stat). For single-threaded runtimes, treat 100% / core_count as the practical maximum.

4. Request count including platform probes¶

Metric: Total request count in App Service or Container Apps metrics.

Common misread: "We're getting 10,000 requests/minute" (used to size infrastructure or calculate per-request cost).

Correct interpretation: The request count may include health check probes, platform keep-alive pings, and ARR affinity checks. These are not customer-initiated requests. On a plan with frequent health probes, platform traffic can be a significant fraction of total request count.

Action: Filter by URL path or user agent to exclude health probe requests. Check if health check frequency multiplied by instance count accounts for the unexpected volume.

5. Memory "available" vs. "committed"¶

Metric: Memory available or memory percentage on App Service.

Common misread: "Only 200MB available out of 1.75GB — we're almost out of memory."

Correct interpretation: "Available" memory in Linux includes memory used for buffer and page cache, which the kernel can reclaim under pressure. Low "available" memory does not necessarily mean the application is under memory pressure. Committed memory (RSS) is a better indicator of actual application memory consumption.

Action: Check MemAvailable, Buffers, Cached, and application RSS via procfs. Compare with cgroup memory usage (memory.usage_in_bytes minus total_inactive_file in cgroup v1).

6. Time granularity masking spikes¶

Metric: Any metric viewed at 5-minute or 1-hour aggregation.

Common misread: "CPU never exceeded 60% during the incident window."

Correct interpretation: A 10-second CPU spike to 100% that causes request timeouts will be averaged down to a much lower value at 5-minute granularity. The spike is real and caused user impact, but it is invisible at coarse time resolution.

Action: Use the finest available granularity (1-minute in Azure Monitor, or sub-minute via custom metrics / procfs polling). For incident investigation, collect high-resolution data during reproduction.

7. Flat memory percentage interpreted as stable risk¶

Metric: App Service MemoryPercentage stays near a flat plateau (for example, ~85%).

Common misread: "Memory stopped rising, so pressure is not worsening."

Correct interpretation: The memory-pressure experiment observed plateau behavior while reclaim/swap pressure increased and startup reliability degraded. Flat plan memory can coexist with worsening risk.

Action: Pair plan memory with startup latency trend, restart events, and reclaim/swap evidence before concluding stability.

Experiment: Memory Pressure

8. RestartCount interpreted as complete OOM detector¶

Metric: RestartCount or replica restart state in Container Apps.

Common misread: "RestartCount is 0, so no OOM event happened."

Correct interpretation: In worker-level OOM scenarios, PID 1 can survive while workers are repeatedly SIGKILLed. Restart metrics stay flat even though requests fail and workers churn.

Action: Use console logs as primary evidence for OOM (SIGKILL, worker boot churn) and treat restart metrics as container-lifecycle only.

Experiment: OOM Visibility Gap

9. 1-minute WorkingSet averages interpreted as true peak memory¶

Metric: WorkingSetBytes at PT1M resolution.

Common misread: "Peak usage was only ~200MB, well below a 0.5Gi limit."

Correct interpretation: One-minute averaging can hide short OOM-adjacent peaks. The OOM experiment captured near-limit RSS in app logs while platform metrics reported much lower averaged values.

Action: Correlate metric points with high-frequency app memory logs around incident windows.

Experiment: OOM Visibility Gap

10. Health state interpreted as immediate traffic routing truth¶

Metric: Instance/revision health state shown in control-plane APIs.

Common misread: "State says UNKNOWN/STOPPED, so it cannot be serving traffic."

Correct interpretation: Health-check eviction tests showed state transitions can lag or encode control-plane interpretation differently from observed routing behavior.

Action: Validate with live traffic distribution sampling and request logs, not state alone.

Experiment: Health Check Eviction

11. Cold-start latency attributed mainly to image pull size¶

Metric: Image size and pull duration during scale-up.

Common misread: "If we shrink image size, cold starts will mostly disappear."

Correct interpretation: Scale-to-zero results showed scheduling and container initialization dominated total cold request latency, with image pull as a smaller component in same-region ACR tests.

Action: Measure full timeline (schedule, pull, start, first response) before prioritizing image-only optimization.

Experiment: Scale-to-Zero 503

12. Probe failure counts interpreted without startup budget math¶

Metric: Number of startup/readiness/liveness probe failures.

Common misread: "Probe failures are random platform noise."

Correct interpretation: Startup-probe experiment showed failures were deterministic when effective budget was below true startup duration.

Action: Compute startup probe budget and compare it to measured app initialization time before escalating platform instability.

Experiment: Startup Probes

Recommendation¶

Note

When possible, cross-reference Azure Monitor metrics with procfs/cgroup data and Application Insights traces. Each data source has its own aggregation, sampling, and scope characteristics. Disagreement between sources is a diagnostic signal, not an error.