VM Observability¶
Monitoring Azure Virtual Machines involves collecting data from the host, the guest operating system (OS), and the workloads running within. This is primarily achieved using the Azure Monitor Agent (AMA) and VM Insights.
Data Flow Diagram¶
graph TD
subgraph VM[Virtual Machine]
App[Application] -->|Logs/Metrics| AMA[Azure Monitor Agent]
OS[Guest OS] -->|Events/Performance| AMA
end
AMA -->|DCR| AM[Azure Monitor]
AM -->|Logs| LAW[Log Analytics Workspace]
AM -->|Metrics| AMMetrics[Azure Monitor Metrics]
LAW -->|Visualization| VMInsights[VM Insights] Core Components¶
- Azure Monitor Agent (AMA): The primary agent for collecting guest OS telemetry. It replaces legacy agents like the Log Analytics agent and Diagnostics extension.
- Data Collection Rules (DCR): Define what data to collect from the agent and where to send it. DCRs provide granular control over data ingestion.
- VM Insights: A feature that provides a simplified onboarding experience and pre-defined visualizations for performance, health, and dependencies (Map).
For production operations, think of the stack as three layers:
- Platform layer
- Azure Monitor metrics from the VM resource
- Activity Log and resource health events
- Guest OS layer
- Heartbeat
- Windows Event logs or Linux Syslog
- Guest performance counters through AMA and DCRs
- Experience layer
- VM Insights workbooks
- Fleet dashboards
- Alert rules and log queries
If only the platform layer is enabled, CPU or availability issues are visible but root-cause evidence inside the guest remains missing. If only the guest layer is enabled, Azure-side maintenance or resource health signals can be overlooked.
Configuration Examples¶
Installing Azure Monitor Agent via CLI¶
To install the AMA extension on a Linux VM:
az vm extension set \
--name "AzureMonitorLinuxAgent" \
--publisher "Microsoft.Azure.Monitor" \
--resource-group "my-resource-group" \
--vm-name "my-linux-vm" \
--enable-auto-upgrade true
Associating a DCR via CLI¶
After creating a Data Collection Rule, associate it with a VM:
az monitor data-collection rule association create \
--name "my-vm-dcr-association" \
--resource "/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Compute/virtualMachines/{vmName}" \
--rule-id "/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Insights/dataCollectionRules/{dcrName}"
AMA vs Legacy Agent Comparison¶
Microsoft recommends Azure Monitor Agent for new deployments because it separates collection policy from agent installation and aligns with Data Collection Rules.
| Capability | Azure Monitor Agent (AMA) | Legacy Log Analytics / Diagnostics agents |
|---|---|---|
| Collection control | Uses DCRs for centralized policy | Configuration is tied more directly to each VM or extension |
| Destinations | Supports modern Azure Monitor routing patterns | Older, less flexible collection model |
| New feature investment | Current strategic agent | Legacy path; not where new monitoring features land |
| VM Insights alignment | Native onboarding path | Transitional or legacy approach |
| Fleet governance | Better for standardized policy at scale | Harder to keep consistent across large estates |
Operationally, this means you should standardize on AMA for new VM onboarding and use DCRs as the source of truth for guest telemetry collection.
KQL Query Examples¶
Monitor VM Heartbeat¶
Verify that your virtual machines are actively reporting to the workspace.
Heartbeat
| where TimeGenerated > ago(1h)
| summarize LastHeartbeat = max(TimeGenerated) by Computer
| order by LastHeartbeat desc
Analyze CPU Performance Counters¶
Retrieve CPU utilization trends for all monitored VMs.
InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "Processor" and Name == "UtilizationPercentage"
| summarize AvgCPU = avg(Val) by Computer, bin(TimeGenerated, 15m)
| render timechart
Search System Event Logs (Windows)¶
Find critical errors in the Windows System event log.
Event
| where EventLog == "System" and EventLevelName == "Error"
| summarize count() by Source, EventID
| order by count_ desc
Detect Missing Heartbeats¶
Heartbeat
| summarize LastHeartbeat=max(TimeGenerated) by Computer, OSType
| extend MinutesSinceHeartbeat = datetime_diff('minute', now(), LastHeartbeat) * -1
| where MinutesSinceHeartbeat > 10
| order by MinutesSinceHeartbeat desc
Review Linux Syslog Errors¶
Syslog
| where TimeGenerated > ago(4h)
| where SeverityLevel in ("err", "crit", "alert", "emerg")
| project TimeGenerated, Computer, ProcessName, SyslogMessage
| order by TimeGenerated desc
Sample output:
TimeGenerated Computer ProcessName SyslogMessage
------------------------- ------------- ------------ ---------------------------------------------
2026-04-06T00:52:00Z vm-linux-01 systemd Failed to start contoso-agent.service
2026-04-06T00:51:00Z vm-linux-01 kernel Out of memory: Killed process 4217 (python)
Monitoring Baseline¶
For Azure Virtual Machines, build your baseline around these four evidence streams:
- Reachability and heartbeat
- Heartbeat freshness
- Agent health
- Performance
- CPU, memory, disk, and network saturation
- Process-level anomalies if collected
- Operating system logs
- Windows Event logs
- Linux Syslog
- Change visibility
- Extension changes
- DCR association changes
- Planned maintenance or reboots
CLI Workflow¶
Verify Azure Monitor Agent extension¶
az vm extension list \
--resource-group "my-resource-group" \
--vm-name "my-linux-vm" \
--output table
Sample output:
Name Publisher ProvisioningState
----------------------- ------------------------ -----------------
AzureMonitorLinuxAgent Microsoft.Azure.Monitor Succeeded
Review DCR associations¶
az monitor data-collection rule association list \
--resource "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Compute/virtualMachines/my-linux-vm"
Sample output:
[
{
"name": "my-vm-dcr-association",
"dataCollectionRuleId": "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/dataCollectionRules/dcr-vm-perf"
}
]
Query recent heartbeats¶
az monitor log-analytics query \
--workspace "law-monitoring-prod" \
--analytics-query "Heartbeat | where TimeGenerated > ago(30m) | summarize LastHeartbeat=max(TimeGenerated) by Computer" \
--output table
Sample output:
Computer LastHeartbeat
------------- -------------------------
vm-linux-01 2026-04-06T01:02:10.000Z
vm-win-01 2026-04-06T01:02:03.000Z
Diagnostic Settings and Collection Strategy¶
VM monitoring uses two different configuration paths that are often confused:
- Diagnostic settings export platform-level signals such as VM metrics and subscription or resource-level Azure events.
- AMA + DCR collect guest operating system logs and performance counters from inside the VM.
Use both. Diagnostic settings alone do not replace guest telemetry, and DCRs alone do not capture Azure-side platform events.
VM resource diagnostic settings baseline¶
For the VM resource, enable metrics export so platform metrics are available centrally.
az monitor diagnostic-settings create \
--name "diag-vm-platform-metrics" \
--resource "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Compute/virtualMachines/my-linux-vm" \
--workspace "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
--metrics '[
{
"category": "AllMetrics",
"enabled": true
}
]'
Subscription Activity Log categories to correlate with VM incidents¶
For platform-side change and outage visibility, route these Activity Log categories to the same workspace used for VM investigations:
AdministrativeResourceHealthServiceHealthAlert
az monitor diagnostic-settings create \
--name "diag-subscription-platform-events" \
--resource "/subscriptions/<subscription-id>" \
--workspace "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
--logs '[
{
"category": "Administrative",
"enabled": true
},
{
"category": "ResourceHealth",
"enabled": true
},
{
"category": "ServiceHealth",
"enabled": true
},
{
"category": "Alert",
"enabled": true
}
]'
Why this matters¶
When a VM restarts or becomes unreachable, guest logs may stop abruptly. Subscription-level ResourceHealth and ServiceHealth events help you determine whether the interruption was caused by Azure platform maintenance, a host issue, or a guest OS problem.
Performance Counter Collection Configuration¶
Performance counters are where cost and diagnostic usefulness must be balanced carefully.
Recommended guest counter baseline¶
- Windows
\\Processor(_Total)\\% Processor Time\\Memory\\Available MBytes\\LogicalDisk(_Total)\\% Free Space\\LogicalDisk(_Total)\\Disk Transfers/sec
- Linux
- Processor utilization
- Available memory
- Filesystem usage
- Network throughput
Create a DCR with performance counters¶
az monitor data-collection rule create \
--resource-group "my-resource-group" \
--name "dcr-vm-perf" \
--location "koreacentral" \
--data-flows '[
{
"streams": ["Microsoft-InsightsMetrics"],
"destinations": ["la-workspace"]
}
]' \
--destinations '{
"logAnalytics": [
{
"workspaceResourceId": "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod",
"name": "la-workspace"
}
]
}' \
--data-sources '{
"performanceCounters": [
{
"name": "vmPerfCounters",
"streams": ["Microsoft-InsightsMetrics"],
"samplingFrequencyInSeconds": 60,
"counterSpecifiers": [
"\\\\Processor(_Total)\\\\% Processor Time",
"\\\\Memory\\\\Available MBytes",
"\\\\LogicalDisk(_Total)\\\\% Free Space"
]
}
]
}'
The exact counter set can differ by operating system, but the pattern stays the same: keep a small high-value baseline at 60-second frequency and add specialized counters only when the workload justifies them.
Additional KQL for Guest OS and VM Insights Analysis¶
Find memory pressure before heartbeat loss¶
InsightsMetrics
| where TimeGenerated > ago(6h)
| where Origin == "vm.azm.ms"
| where Namespace == "Memory" and Name in ("AvailableMB", "AvailableMBs")
| summarize MinAvailableMemory=min(Val), AvgAvailableMemory=avg(Val) by Computer, bin(TimeGenerated, 15m)
| order by TimeGenerated desc
Sample output:
| Computer | TimeGenerated | MinAvailableMemory | AvgAvailableMemory | Interpretation |
|---|---|---|---|---|
| vm-linux-01 | 2026-04-06T00:45:00Z | 182 | 240 | Memory pressure likely contributed to instability; correlate with Syslog or kernel OOM events. |
| vm-win-01 | 2026-04-06T00:45:00Z | 3240 | 3395 | Memory is healthy; investigate CPU, disk, or application-specific causes instead. |
Correlate heartbeat gaps with platform health events¶
let MissingHeartbeat =
Heartbeat
| summarize LastHeartbeat=max(TimeGenerated) by Computer, _ResourceId
| extend MinutesSinceHeartbeat = datetime_diff('minute', now(), LastHeartbeat) * -1
| where MinutesSinceHeartbeat > 10;
AzureActivity
| where TimeGenerated > ago(24h)
| where CategoryValue in ("Administrative", "ResourceHealth", "ServiceHealth")
| project TimeGenerated, ResourceId, OperationNameValue, ActivityStatusValue
| join kind=leftouter MissingHeartbeat on $left.ResourceId == $right._ResourceId
Sample output:
| TimeGenerated | ResourceId | OperationNameValue | ActivityStatusValue | Computer | MinutesSinceHeartbeat | Interpretation |
|---|---|---|---|---|---|---|
| 2026-04-06T00:58:00Z | /subscriptions/ | Microsoft.ResourceHealth/healthevent/Activated/action | Active | vm-linux-01 | 17 | Missing heartbeat may align with Azure-side platform health activity. |
Surface noisy event log sources¶
Event
| where TimeGenerated > ago(12h)
| summarize EventCount=count() by Computer, Source, EventLevelName
| order by EventCount desc
Sample output:
| Computer | Source | EventLevelName | EventCount | Interpretation |
|---|---|---|---|---|
| vm-win-01 | Service Control Manager | Error | 46 | Repeated service restarts are a likely primary symptom. |
| vm-win-01 | Disk | Warning | 18 | Storage subsystem issues may be contributing to degraded performance. |
VM Insights Setup and Capabilities¶
VM Insights provides a quicker operator experience than raw queries alone because it layers fleet visualizations and dependency views on top of AMA-collected telemetry.
What VM Insights is best for¶
- Fleet-wide performance comparison
- Fast identification of unhealthy machines
- Dependency map review for connected processes and endpoints
- Out-of-box charts for CPU, memory, disk, and network trends
What it does not replace¶
- DCR design
- KQL-based incident-specific investigations
- Workload-specific application telemetry
Use VM Insights as the first stop for posture and trend review, then move to Logs when you need detailed OS evidence or cross-resource correlation.
Guest OS Metrics Collection Guidance¶
Guest OS metrics are more precise than Azure resource metrics for many operating system investigations because they reflect the VM interior rather than only the hypervisor view.
Use guest metrics for¶
- Available memory and swap pressure
- Filesystem free space inside the guest
- Per-disk queue or transfer rates
- Process and service troubleshooting when combined with Event or Syslog data
Use platform metrics for¶
- High-level CPU trend monitoring
- Fast alerting with minimal ingestion cost
- Azure resource-centric dashboards shared across teams
The strongest operating model is to alert first on platform metrics, then diagnose with guest metrics and logs.
Practical Alert Examples¶
Alert on missing heartbeat¶
az monitor scheduled-query create \
--name "vm-missing-heartbeat" \
--resource-group "my-resource-group" \
--scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
--condition "count 'Heartbeat | summarize LastHeartbeat=max(TimeGenerated) by Computer | extend MinutesSinceHeartbeat = datetime_diff(\"minute\", now(), LastHeartbeat) * -1 | where MinutesSinceHeartbeat > 10' > 0" \
--description "One or more virtual machines stopped sending heartbeats" \
--evaluation-frequency "5m" \
--window-size "5m" \
--severity 1 \
--action-groups "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"
Alert on sustained CPU saturation¶
az monitor metrics alert create \
--name "vm-cpu-high" \
--resource-group "my-resource-group" \
--scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Compute/virtualMachines/my-linux-vm" \
--condition "avg Percentage CPU > 85" \
--window-size "15m" \
--evaluation-frequency "5m" \
--severity 2 \
--description "Virtual machine CPU usage is above 85 percent" \
--action "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"
Investigation Workflow¶
- Heartbeat
- Did the VM stop sending data entirely?
- Performance counters / InsightsMetrics
- Was CPU, memory, or disk already degrading before the symptom?
- OS logs
- Are there service failures, kernel issues, or driver errors?
- Recent changes
- Was the AMA extension updated?
- Did the DCR change?
- Workload context
- Is the issue application-specific or host-wide?
Windows and Linux Coverage Guidance¶
- Windows VMs
- Collect System and Application event logs for baseline operations
- Add Security logs only when required and sized appropriately
- Linux VMs
- Collect Syslog with severity filters to control ingestion
- Include performance counters for CPU, memory, filesystem, and network activity
- Both
- Use the same workspace naming and DCR pattern across environments for easier fleet queries
Workbook Suggestions¶
- Fleet heartbeat status
- Top VMs by CPU and memory utilization
- Windows error events by source and event ID
- Linux Syslog error trend by host
- DCR coverage view to find VMs missing associations
Dashboard and Workbook Recommendations¶
Built-in views to rely on first¶
- VM Insights performance workbook
- CPU, memory, disk, and network trend views
- Fast drill-down from fleet to individual VM
- Metrics explorer
- Best for low-latency metric alert tuning
- Log Analytics workbooks
- Best for combining Heartbeat, InsightsMetrics, Event, Syslog, and AzureActivity data
Recommended workbook tabs¶
- Fleet health
- Last heartbeat by VM
- VMs missing AMA extension
- VMs without DCR association
- Performance
- CPU, available memory, disk free space
- Top resource-saturated VMs
- Operating system evidence
- Windows Event errors by source
- Linux Syslog critical entries by process
- Platform correlation
- Resource health events
- Administrative changes
- Alert firing history by VM
Dashboard pins¶
- Pin Last heartbeat by VM.
- Pin Top VMs by CPU.
- Pin Top VMs by low memory.
- Pin a log tile for Windows
Eventerrors and another for LinuxSyslogcritical messages.
Common Pitfalls¶
Mistake 1: Assuming AMA installation alone means monitoring is complete¶
What happens: The extension exists, but no DCR is attached, so expected guest logs and counters never arrive.
Correction: Validate both extension health and DCR association for every monitored VM.
Mistake 2: Collecting too many event logs or counters by default¶
What happens: Ingestion cost grows quickly, and operators struggle to find the high-signal data.
Correction: Start with a small baseline of heartbeat, essential counters, and targeted Windows or Linux logs, then expand only for justified scenarios.
Mistake 3: Ignoring subscription-level platform events during outages¶
What happens: Teams investigate guest OS logs for hours even when Azure resource health already explains the interruption.
Correction: Route ResourceHealth and ServiceHealth to the same workspace and check them early in the incident flow.
Cost Notes¶
- Heartbeat and core performance counters are low-cost and high-value; collect them everywhere.
- Security or verbose application logs can dominate ingestion cost if sent without filters.
- Use DCRs to narrow Windows Event IDs and Linux Syslog facilities instead of collecting everything by default.
Cost Considerations¶
VM observability cost usually scales with guest log breadth rather than with the core VM metric set.
- Low-cost baseline
- Heartbeat
- Small set of performance counters at 60-second sampling
- Targeted System/Application events or Syslog severity filtering
- Higher-cost patterns
- Broad Windows Security log collection
- Verbose application logs forwarded from the guest
- High-frequency counters with little operational value
- Practical estimate
- A baseline of heartbeat plus a few counters is often measured in MB/day per VM, while verbose security or application logging can increase that by an order of magnitude.
- Optimization tips
- Use one DCR baseline per environment and attach workload-specific add-on DCRs only where needed.
- Filter Linux Syslog by severity and facility.
- Restrict Windows Event collection to required channels and IDs.
- Review per-table ingestion monthly to confirm
Event,Syslog, or custom logs are not crowding out the baseline signals.