AKS Observability¶
Monitoring Azure Kubernetes Service (AKS) requires a multi-layered approach covering the control plane, nodes, and containerized workloads. This is achieved through Container Insights, Azure Monitor Managed Service for Prometheus, and Azure Managed Grafana.
Data Flow Diagram¶
graph TD
subgraph Cluster[AKS Cluster]
Pod[Pod/Container] -->|Stdout/Stderr| Kubelet[Kubelet]
Pod -->|Metrics| PromAgent[Prometheus Collector]
Node[Node] -->|System Metrics| Kubelet
end
Kubelet -->|Logs| CI[Container Insights Agent]
CI -->|HTTPS| LAW[Log Analytics Workspace]
PromAgent -->|Scrape| AMW[Azure Monitor Workspace]
AMW -->|Visualization| Grafana[Azure Managed Grafana]
LAW -->|Queries| KQL[KQL Queries] Monitoring Components¶
- Container Insights: Collects memory and processor metrics from controllers, nodes, and containers. It also captures container stdout/stderr logs and Kubernetes events.
- Managed Prometheus: A fully managed service based on the Prometheus project from the Cloud Native Computing Foundation. It scrapes metrics from your nodes and pods.
- Azure Managed Grafana: Provides pre-built dashboards for visualizing Prometheus metrics and Container Insights data.
Configuration Examples¶
Enabling Container Insights via CLI¶
To enable the monitoring addon for an existing AKS cluster and link it to a Log Analytics workspace:
az aks enable-addons \
--resource-group "my-resource-group" \
--name "my-aks-cluster" \
--addons monitoring \
--workspace-resource-id "/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.OperationalInsights/workspaces/{workspaceName}"
Enabling Managed Prometheus via CLI¶
az aks update \
--resource-group "my-resource-group" \
--name "my-aks-cluster" \
--enable-azure-monitor-metrics
KQL Query Examples¶
Search Pod Logs¶
Retrieve stdout/stderr logs for a specific pod.
ContainerLogV2
| where TimeGenerated > ago(1h)
| where PodName == "my-pod-name"
| project TimeGenerated, LogSource, Message
| order by TimeGenerated desc
Monitor Node CPU Usage¶
Analyze average CPU utilization across all nodes in the cluster.
KubeNodeInventory
| where TimeGenerated > ago(1h)
| summarize AvgCPU = avg(CpuUsageNanoCores) by NodeName, bin(TimeGenerated, 5m)
| render timechart
Find Failed Pods¶
List pods that are not in a 'Running' state.
KubePodInventory
| where TimeGenerated > ago(1h)
| where PodStatus != "Running"
| summarize count() by PodStatus, Name
Correlate Restarts With Namespace¶
Use KubePodInventory to find pods that restart frequently even if they eventually return to Running.
KubePodInventory
| where TimeGenerated > ago(24h)
| summarize RestartCount = max(ContainerRestartCount) by Namespace, Name, ClusterName
| where RestartCount > 3
| order by RestartCount desc
Surface Kubernetes Events That Need Immediate Review¶
This query helps operators focus on warning events such as failed scheduling, image pull problems, or probe failures.
KubeEvents
| where TimeGenerated > ago(6h)
| where Type == "Warning"
| project TimeGenerated, Namespace, ObjectKind, Name, Reason, Message
| order by TimeGenerated desc
Sample Output¶
TimeGenerated Namespace ObjectKind Name Reason Message
------------------------- ------------ ----------- -------------------------------- ----------------- ------------------------------------------
2026-04-06T01:15:00Z payments Pod payments-api-7df5b7c9b6-k2n4q Unhealthy Readiness probe failed: HTTP 500
2026-04-06T01:12:00Z kube-system Pod coredns-6d4b75cb6d-8m2zt BackOff Back-off restarting failed container
2026-04-06T01:08:00Z orders Pod orders-worker-5f8f9d9758-2s9hm FailedScheduling 0/3 nodes are available: 3 Insufficient cpu
Practical Monitoring Baseline¶
For production AKS, treat observability as three separate but connected layers:
- Platform health
- Control plane availability
- Node readiness and kubelet health
- Cluster autoscaler behavior
- Workload health
- Pod restarts
- Deployment rollout failures
- Namespace-specific error rates
- Application health
- Container logs
- Service latency
- SLO/SLA metrics exported through Prometheus
If only one layer is enabled, incidents usually take longer to triage. For example, a CPU spike on nodes may look like an application issue until KubeEvents and pod restart data are reviewed together.
Recommended Setup Sequence¶
1. Verify addon and workspace connectivity¶
az aks show \
--resource-group "my-resource-group" \
--name "my-aks-cluster" \
--query "addonProfiles.omsagent"
Sample output:
{
"config": {
"logAnalyticsWorkspaceResourceID": "/subscriptions/<subscription-id>/resourceGroups/rg-monitor/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod"
},
"enabled": true,
"identity": {
"clientId": "system-assigned"
}
}
2. Confirm managed Prometheus is enabled¶
az aks show \
--resource-group "my-resource-group" \
--name "my-aks-cluster" \
--query "azureMonitorProfile.metrics"
Sample output:
3. Review data collection before an incident¶
az monitor log-analytics query \
--workspace "law-monitoring-prod" \
--analytics-query "KubeNodeInventory | summarize Nodes=dcount(Computer)" \
--output table
Sample output:
Alerting Patterns¶
Metric alerts are useful for rapid detection, while log alerts are better for Kubernetes-specific state and event analysis.
Metric alert for node CPU saturation¶
az monitor metrics alert create \
--name "aks-node-cpu-high" \
--resource-group "my-resource-group" \
--scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.ContainerService/managedClusters/my-aks-cluster" \
--condition "avg node_cpu_usage_millicores / avg node_cpu_capacity_millicores * 100 > 85" \
--description "AKS node CPU usage is above 85 percent" \
--window-size "5m" \
--evaluation-frequency "1m" \
--severity 2 \
--action "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"
Scheduled query alert for failed pods¶
az monitor scheduled-query create \
--name "aks-failed-pods" \
--resource-group "my-resource-group" \
--scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
--condition "count 'KubePodInventory | where TimeGenerated > ago(5m) | where PodStatus in (\"Failed\", \"Unknown\")' > 0" \
--description "One or more AKS pods entered Failed or Unknown state" \
--evaluation-frequency "5m" \
--window-size "5m" \
--severity 2 \
--action-groups "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"
What to alert on first¶
- High priority
- Node NotReady
- Repeated pod restarts
- Image pull failures
- Persistent volume mount failures
- Medium priority
- Namespace-specific latency increase
- HPA scaling limits reached
- Elevated warning event volume
- Low priority / dashboard only
- Temporary pod rescheduling during rollouts
- Short-lived cluster autoscaler adjustments
Operational Queries for Triage¶
Namespace error hunt from container logs¶
ContainerLogV2
| where TimeGenerated > ago(30m)
| where Namespace == "payments"
| where Message has_any ("ERROR", "Exception", "timeout", "connection refused")
| project TimeGenerated, PodName, ContainerName, Message
| order by TimeGenerated desc
Detect nodes with pressure conditions¶
KubeNodeInventory
| where TimeGenerated > ago(30m)
| summarize arg_max(TimeGenerated, *) by ClusterName, Computer
| project ClusterName, Computer, Status, NodeConditions
| where NodeConditions has_any ("DiskPressure", "MemoryPressure", "PIDPressure")
Identify noisy namespaces by log volume¶
ContainerLogV2
| where TimeGenerated > ago(1h)
| summarize LogLines=count() by Namespace
| order by LogLines desc
Sample output:
Namespace LogLines
------------- --------
payments 18234
orders 9411
ingress-nginx 5228
monitoring 1950
Dashboard and Workbook Ideas¶
Build one AKS workbook per cluster or environment with the following sections:
- Cluster overview
- Node count
- Ready vs NotReady nodes
- Total running pods
- Workload stability
- Top restarting pods
- Warning events by namespace
- Failed deployments in the last 24 hours
- Resource pressure
- CPU and memory utilization by node pool
- Pending pods caused by capacity constraints
- Application symptoms
- Error log trend by namespace
- Latency histograms from Prometheus or Application Insights
Troubleshooting Checklist¶
When alerts fire, review evidence in this order:
- Cluster-wide state
- Are nodes healthy?
- Is a node pool unavailable?
- Kubernetes events
- Are there scheduling or image pull warnings?
- Pod stability
- Are restarts increasing?
- Is a rollout stuck?
- Container logs
- Is the problem isolated to one namespace or deployment?
- Prometheus service metrics
- Do request rate, error rate, and latency confirm customer impact?
Cost and Retention Notes¶
- Container logs can become the largest AKS monitoring cost driver.
- Retain verbose stdout/stderr only as long as needed for incident response.
- Use Prometheus metrics for high-frequency infrastructure trends instead of exporting every event as logs.
- Keep log alerts focused on state changes and failure signals rather than every warning line.