AKS Observability¶

Monitoring Azure Kubernetes Service (AKS) requires a multi-layered approach covering the control plane, nodes, and containerized workloads. This is achieved through Container Insights, Azure Monitor Managed Service for Prometheus, and Azure Managed Grafana.

Data Flow Diagram¶

graph TD
    subgraph Cluster[AKS Cluster]
        Pod[Pod/Container] -->|Stdout/Stderr| Kubelet[Kubelet]
        Pod -->|Metrics| PromAgent[Prometheus Collector]
        Node[Node] -->|System Metrics| Kubelet
    end
    Kubelet -->|Logs| CI[Container Insights Agent]
    CI -->|HTTPS| LAW[Log Analytics Workspace]
    PromAgent -->|Scrape| AMW[Azure Monitor Workspace]
    AMW -->|Visualization| Grafana[Azure Managed Grafana]
    LAW -->|Queries| KQL[KQL Queries]

Monitoring Components¶

Container Insights: Collects memory and processor metrics from controllers, nodes, and containers. It also captures container stdout/stderr logs and Kubernetes events.
Managed Prometheus: A fully managed service based on the Prometheus project from the Cloud Native Computing Foundation. It scrapes metrics from your nodes and pods.
Azure Managed Grafana: Provides pre-built dashboards for visualizing Prometheus metrics and Container Insights data.

Configuration Examples¶

Enabling Container Insights via CLI¶

To enable the monitoring addon for an existing AKS cluster and link it to a Log Analytics workspace:

az aks enable-addons \
    --resource-group "my-resource-group" \
    --name "my-aks-cluster" \
    --addons monitoring \
    --workspace-resource-id "/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.OperationalInsights/workspaces/{workspaceName}"

Enabling Managed Prometheus via CLI¶

az aks update \
    --resource-group "my-resource-group" \
    --name "my-aks-cluster" \
    --enable-azure-monitor-metrics

KQL Query Examples¶

Search Pod Logs¶

Retrieve stdout/stderr logs for a specific pod.

ContainerLogV2
| where TimeGenerated > ago(1h)
| where PodName == "my-pod-name"
| project TimeGenerated, LogSource, Message
| order by TimeGenerated desc

Monitor Node CPU Usage¶

Analyze average CPU utilization across all nodes in the cluster.

KubeNodeInventory
| where TimeGenerated > ago(1h)
| summarize AvgCPU = avg(CpuUsageNanoCores) by NodeName, bin(TimeGenerated, 5m)
| render timechart

Find Failed Pods¶

List pods that are not in a 'Running' state.

KubePodInventory
| where TimeGenerated > ago(1h)
| where PodStatus != "Running"
| summarize count() by PodStatus, Name

Correlate Restarts With Namespace¶

Use KubePodInventory to find pods that restart frequently even if they eventually return to Running.

KubePodInventory
| where TimeGenerated > ago(24h)
| summarize RestartCount = max(ContainerRestartCount) by Namespace, Name, ClusterName
| where RestartCount > 3
| order by RestartCount desc

Surface Kubernetes Events That Need Immediate Review¶

This query helps operators focus on warning events such as failed scheduling, image pull problems, or probe failures.

KubeEvents
| where TimeGenerated > ago(6h)
| where Type == "Warning"
| project TimeGenerated, Namespace, ObjectKind, Name, Reason, Message
| order by TimeGenerated desc

Sample Output¶

TimeGenerated              Namespace     ObjectKind   Name                              Reason             Message
-------------------------  ------------  -----------  --------------------------------  -----------------  ------------------------------------------
2026-04-06T01:15:00Z       payments      Pod          payments-api-7df5b7c9b6-k2n4q    Unhealthy          Readiness probe failed: HTTP 500
2026-04-06T01:12:00Z       kube-system   Pod          coredns-6d4b75cb6d-8m2zt         BackOff            Back-off restarting failed container
2026-04-06T01:08:00Z       orders        Pod          orders-worker-5f8f9d9758-2s9hm   FailedScheduling   0/3 nodes are available: 3 Insufficient cpu

Practical Monitoring Baseline¶

For production AKS, treat observability as three separate but connected layers:

Platform health
- Control plane availability
- Node readiness and kubelet health
- Cluster autoscaler behavior
Workload health
- Pod restarts
- Deployment rollout failures
- Namespace-specific error rates
Application health
- Container logs
- Service latency
- SLO/SLA metrics exported through Prometheus

If only one layer is enabled, incidents usually take longer to triage. For example, a CPU spike on nodes may look like an application issue until KubeEvents and pod restart data are reviewed together.

Recommended Setup Sequence¶

1. Verify addon and workspace connectivity¶

az aks show \
    --resource-group "my-resource-group" \
    --name "my-aks-cluster" \
    --query "addonProfiles.omsagent"

Sample output:

{
  "config": {
    "logAnalyticsWorkspaceResourceID": "/subscriptions/<subscription-id>/resourceGroups/rg-monitor/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod"
  },
  "enabled": true,
  "identity": {
    "clientId": "system-assigned"
  }
}

2. Confirm managed Prometheus is enabled¶

az aks show \
    --resource-group "my-resource-group" \
    --name "my-aks-cluster" \
    --query "azureMonitorProfile.metrics"

Sample output:

{
  "enabled": true,
  "kubeStateMetrics": {
    "metricLabelsAllowlist": "namespaces=[*],pods=[*]"
  }
}

3. Review data collection before an incident¶

az monitor log-analytics query \
    --workspace "law-monitoring-prod" \
    --analytics-query "KubeNodeInventory | summarize Nodes=dcount(Computer)" \
    --output table

Sample output:

Nodes
-----
3

Alerting Patterns¶

Metric alerts are useful for rapid detection, while log alerts are better for Kubernetes-specific state and event analysis.

Metric alert for node CPU saturation¶

az monitor metrics alert create \
    --name "aks-node-cpu-high" \
    --resource-group "my-resource-group" \
    --scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.ContainerService/managedClusters/my-aks-cluster" \
    --condition "avg node_cpu_usage_millicores / avg node_cpu_capacity_millicores * 100 > 85" \
    --description "AKS node CPU usage is above 85 percent" \
    --window-size "5m" \
    --evaluation-frequency "1m" \
    --severity 2 \
    --action "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

Scheduled query alert for failed pods¶

az monitor scheduled-query create \
    --name "aks-failed-pods" \
    --resource-group "my-resource-group" \
    --scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
    --condition "count 'KubePodInventory | where TimeGenerated > ago(5m) | where PodStatus in (\"Failed\", \"Unknown\")' > 0" \
    --description "One or more AKS pods entered Failed or Unknown state" \
    --evaluation-frequency "5m" \
    --window-size "5m" \
    --severity 2 \
    --action-groups "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

What to alert on first¶

High priority
- Node NotReady
- Repeated pod restarts
- Image pull failures
- Persistent volume mount failures
Medium priority
- Namespace-specific latency increase
- HPA scaling limits reached
- Elevated warning event volume
Low priority / dashboard only
- Temporary pod rescheduling during rollouts
- Short-lived cluster autoscaler adjustments

Operational Queries for Triage¶

Namespace error hunt from container logs¶

ContainerLogV2
| where TimeGenerated > ago(30m)
| where Namespace == "payments"
| where Message has_any ("ERROR", "Exception", "timeout", "connection refused")
| project TimeGenerated, PodName, ContainerName, Message
| order by TimeGenerated desc

Detect nodes with pressure conditions¶

KubeNodeInventory
| where TimeGenerated > ago(30m)
| summarize arg_max(TimeGenerated, *) by ClusterName, Computer
| project ClusterName, Computer, Status, NodeConditions
| where NodeConditions has_any ("DiskPressure", "MemoryPressure", "PIDPressure")

Identify noisy namespaces by log volume¶

ContainerLogV2
| where TimeGenerated > ago(1h)
| summarize LogLines=count() by Namespace
| order by LogLines desc

Sample output:

Namespace      LogLines
-------------  --------
payments       18234
orders         9411
ingress-nginx  5228
monitoring     1950

Dashboard and Workbook Ideas¶

Build one AKS workbook per cluster or environment with the following sections:

Cluster overview
- Node count
- Ready vs NotReady nodes
- Total running pods
Workload stability
- Top restarting pods
- Warning events by namespace
- Failed deployments in the last 24 hours
Resource pressure
- CPU and memory utilization by node pool
- Pending pods caused by capacity constraints
Application symptoms
- Error log trend by namespace
- Latency histograms from Prometheus or Application Insights

Troubleshooting Checklist¶

When alerts fire, review evidence in this order:

Cluster-wide state
- Are nodes healthy?
- Is a node pool unavailable?
Kubernetes events
- Are there scheduling or image pull warnings?
Pod stability
- Are restarts increasing?
- Is a rollout stuck?
Container logs
- Is the problem isolated to one namespace or deployment?
Prometheus service metrics
- Do request rate, error rate, and latency confirm customer impact?

Cost and Retention Notes¶

Container logs can become the largest AKS monitoring cost driver.
Retain verbose stdout/stderr only as long as needed for incident response.
Use Prometheus metrics for high-frequency infrastructure trends instead of exporting every event as logs.
Keep log alerts focused on state changes and failure signals rather than every warning line.

AKS Observability¶

Data Flow Diagram¶

Monitoring Components¶

Configuration Examples¶

Enabling Container Insights via CLI¶

Enabling Managed Prometheus via CLI¶

KQL Query Examples¶

Search Pod Logs¶

Monitor Node CPU Usage¶

Find Failed Pods¶

Correlate Restarts With Namespace¶

Surface Kubernetes Events That Need Immediate Review¶

Sample Output¶

Practical Monitoring Baseline¶

Recommended Setup Sequence¶

1. Verify addon and workspace connectivity¶

2. Confirm managed Prometheus is enabled¶

3. Review data collection before an incident¶

Alerting Patterns¶

Metric alert for node CPU saturation¶

Scheduled query alert for failed pods¶

What to alert on first¶

Operational Queries for Triage¶

Namespace error hunt from container logs¶

Detect nodes with pressure conditions¶

Identify noisy namespaces by log volume¶

Dashboard and Workbook Ideas¶

Troubleshooting Checklist¶

Cost and Retention Notes¶

See Also¶

Sources¶