VM Observability¶

Monitoring Azure Virtual Machines involves collecting data from the host, the guest operating system (OS), and the workloads running within. This is primarily achieved using the Azure Monitor Agent (AMA) and VM Insights.

Data Flow Diagram¶

graph TD
    subgraph VM[Virtual Machine]
        App[Application] -->|Logs/Metrics| AMA[Azure Monitor Agent]
        OS[Guest OS] -->|Events/Performance| AMA
    end
    AMA -->|DCR| AM[Azure Monitor]
    AM -->|Logs| LAW[Log Analytics Workspace]
    AM -->|Metrics| AMMetrics[Azure Monitor Metrics]
    LAW -->|Visualization| VMInsights[VM Insights]

Core Components¶

Azure Monitor Agent (AMA): The primary agent for collecting guest OS telemetry. It replaces legacy agents like the Log Analytics agent and Diagnostics extension.
Data Collection Rules (DCR): Define what data to collect from the agent and where to send it. DCRs provide granular control over data ingestion.
VM Insights: A feature that provides a simplified onboarding experience and pre-defined visualizations for performance, health, and dependencies (Map).

For production operations, think of the stack as three layers:

Platform layer
- Azure Monitor metrics from the VM resource
- Activity Log and resource health events
Guest OS layer
- Heartbeat
- Windows Event logs or Linux Syslog
- Guest performance counters through AMA and DCRs
Experience layer
- VM Insights workbooks
- Fleet dashboards
- Alert rules and log queries

If only the platform layer is enabled, CPU or availability issues are visible but root-cause evidence inside the guest remains missing. If only the guest layer is enabled, Azure-side maintenance or resource health signals can be overlooked.

Configuration Examples¶

Installing Azure Monitor Agent via CLI¶

To install the AMA extension on a Linux VM:

az vm extension set \
    --name "AzureMonitorLinuxAgent" \
    --publisher "Microsoft.Azure.Monitor" \
    --resource-group "my-resource-group" \
    --vm-name "my-linux-vm" \
    --enable-auto-upgrade true

Associating a DCR via CLI¶

After creating a Data Collection Rule, associate it with a VM:

az monitor data-collection rule association create \
    --name "my-vm-dcr-association" \
    --resource "/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Compute/virtualMachines/{vmName}" \
    --rule-id "/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Insights/dataCollectionRules/{dcrName}"

AMA vs Legacy Agent Comparison¶

Microsoft recommends Azure Monitor Agent for new deployments because it separates collection policy from agent installation and aligns with Data Collection Rules.

Capability	Azure Monitor Agent (AMA)	Legacy Log Analytics / Diagnostics agents
Collection control	Uses DCRs for centralized policy	Configuration is tied more directly to each VM or extension
Destinations	Supports modern Azure Monitor routing patterns	Older, less flexible collection model
New feature investment	Current strategic agent	Legacy path; not where new monitoring features land
VM Insights alignment	Native onboarding path	Transitional or legacy approach
Fleet governance	Better for standardized policy at scale	Harder to keep consistent across large estates

Operationally, this means you should standardize on AMA for new VM onboarding and use DCRs as the source of truth for guest telemetry collection.

KQL Query Examples¶

Monitor VM Heartbeat¶

Verify that your virtual machines are actively reporting to the workspace.

Heartbeat
| where TimeGenerated > ago(1h)
| summarize LastHeartbeat = max(TimeGenerated) by Computer
| order by LastHeartbeat desc

Analyze CPU Performance Counters¶

Retrieve CPU utilization trends for all monitored VMs.

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "Processor" and Name == "UtilizationPercentage"
| summarize AvgCPU = avg(Val) by Computer, bin(TimeGenerated, 15m)
| render timechart

Search System Event Logs (Windows)¶

Find critical errors in the Windows System event log.

Event
| where EventLog == "System" and EventLevelName == "Error"
| summarize count() by Source, EventID
| order by count_ desc

Detect Missing Heartbeats¶

Heartbeat
| summarize LastHeartbeat=max(TimeGenerated) by Computer, OSType
| extend MinutesSinceHeartbeat = datetime_diff('minute', now(), LastHeartbeat) * -1
| where MinutesSinceHeartbeat > 10
| order by MinutesSinceHeartbeat desc

Review Linux Syslog Errors¶

Syslog
| where TimeGenerated > ago(4h)
| where SeverityLevel in ("err", "crit", "alert", "emerg")
| project TimeGenerated, Computer, ProcessName, SyslogMessage
| order by TimeGenerated desc

Sample output:

TimeGenerated              Computer       ProcessName   SyslogMessage
-------------------------  -------------  ------------  ---------------------------------------------
2026-04-06T00:52:00Z       vm-linux-01    systemd       Failed to start contoso-agent.service
2026-04-06T00:51:00Z       vm-linux-01    kernel        Out of memory: Killed process 4217 (python)

Monitoring Baseline¶

For Azure Virtual Machines, build your baseline around these four evidence streams:

Reachability and heartbeat
- Heartbeat freshness
- Agent health
Performance
- CPU, memory, disk, and network saturation
- Process-level anomalies if collected
Operating system logs
- Windows Event logs
- Linux Syslog
Change visibility
- Extension changes
- DCR association changes
- Planned maintenance or reboots

CLI Workflow¶

Verify Azure Monitor Agent extension¶

az vm extension list \
    --resource-group "my-resource-group" \
    --vm-name "my-linux-vm" \
    --output table

Sample output:

Name                     Publisher                 ProvisioningState
-----------------------  ------------------------  -----------------
AzureMonitorLinuxAgent   Microsoft.Azure.Monitor   Succeeded

Review DCR associations¶

az monitor data-collection rule association list \
    --resource "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Compute/virtualMachines/my-linux-vm"

Sample output:

[
  {
    "name": "my-vm-dcr-association",
    "dataCollectionRuleId": "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/dataCollectionRules/dcr-vm-perf"
  }
]

Query recent heartbeats¶

az monitor log-analytics query \
    --workspace "law-monitoring-prod" \
    --analytics-query "Heartbeat | where TimeGenerated > ago(30m) | summarize LastHeartbeat=max(TimeGenerated) by Computer" \
    --output table

Sample output:

Computer       LastHeartbeat
-------------  -------------------------
vm-linux-01    2026-04-06T01:02:10.000Z
vm-win-01      2026-04-06T01:02:03.000Z

Diagnostic Settings and Collection Strategy¶

VM monitoring uses two different configuration paths that are often confused:

Diagnostic settings export platform-level signals such as VM metrics and subscription or resource-level Azure events.
AMA + DCR collect guest operating system logs and performance counters from inside the VM.

Use both. Diagnostic settings alone do not replace guest telemetry, and DCRs alone do not capture Azure-side platform events.

VM resource diagnostic settings baseline¶

For the VM resource, enable metrics export so platform metrics are available centrally.

az monitor diagnostic-settings create \
    --name "diag-vm-platform-metrics" \
    --resource "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Compute/virtualMachines/my-linux-vm" \
    --workspace "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
    --metrics '[
        {
            "category": "AllMetrics",
            "enabled": true
        }
    ]'

Subscription Activity Log categories to correlate with VM incidents¶

For platform-side change and outage visibility, route these Activity Log categories to the same workspace used for VM investigations:

Administrative
ResourceHealth
ServiceHealth
Alert

az monitor diagnostic-settings create \
    --name "diag-subscription-platform-events" \
    --resource "/subscriptions/<subscription-id>" \
    --workspace "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
    --logs '[
        {
            "category": "Administrative",
            "enabled": true
        },
        {
            "category": "ResourceHealth",
            "enabled": true
        },
        {
            "category": "ServiceHealth",
            "enabled": true
        },
        {
            "category": "Alert",
            "enabled": true
        }
    ]'

Why this matters¶

When a VM restarts or becomes unreachable, guest logs may stop abruptly. Subscription-level ResourceHealth and ServiceHealth events help you determine whether the interruption was caused by Azure platform maintenance, a host issue, or a guest OS problem.

Performance Counter Collection Configuration¶

Performance counters are where cost and diagnostic usefulness must be balanced carefully.

Recommended guest counter baseline¶

Windows
- \\Processor(_Total)\\% Processor Time
- \\Memory\\Available MBytes
- \\LogicalDisk(_Total)\\% Free Space
- \\LogicalDisk(_Total)\\Disk Transfers/sec
Linux
- Processor utilization
- Available memory
- Filesystem usage
- Network throughput

Create a DCR with performance counters¶

az monitor data-collection rule create \
    --resource-group "my-resource-group" \
    --name "dcr-vm-perf" \
    --location "koreacentral" \
    --data-flows '[
        {
            "streams": ["Microsoft-InsightsMetrics"],
            "destinations": ["la-workspace"]
        }
    ]' \
    --destinations '{
        "logAnalytics": [
            {
                "workspaceResourceId": "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod",
                "name": "la-workspace"
            }
        ]
    }' \
    --data-sources '{
        "performanceCounters": [
            {
                "name": "vmPerfCounters",
                "streams": ["Microsoft-InsightsMetrics"],
                "samplingFrequencyInSeconds": 60,
                "counterSpecifiers": [
                    "\\\\Processor(_Total)\\\\% Processor Time",
                    "\\\\Memory\\\\Available MBytes",
                    "\\\\LogicalDisk(_Total)\\\\% Free Space"
                ]
            }
        ]
    }'

The exact counter set can differ by operating system, but the pattern stays the same: keep a small high-value baseline at 60-second frequency and add specialized counters only when the workload justifies them.

Additional KQL for Guest OS and VM Insights Analysis¶

Find memory pressure before heartbeat loss¶

InsightsMetrics
| where TimeGenerated > ago(6h)
| where Origin == "vm.azm.ms"
| where Namespace == "Memory" and Name in ("AvailableMB", "AvailableMBs")
| summarize MinAvailableMemory=min(Val), AvgAvailableMemory=avg(Val) by Computer, bin(TimeGenerated, 15m)
| order by TimeGenerated desc

Sample output:

Computer	TimeGenerated	MinAvailableMemory	AvgAvailableMemory	Interpretation
vm-linux-01	2026-04-06T00:45:00Z	182	240	Memory pressure likely contributed to instability; correlate with Syslog or kernel OOM events.
vm-win-01	2026-04-06T00:45:00Z	3240	3395	Memory is healthy; investigate CPU, disk, or application-specific causes instead.

Correlate heartbeat gaps with platform health events¶

let MissingHeartbeat =
    Heartbeat
    | summarize LastHeartbeat=max(TimeGenerated) by Computer, _ResourceId
    | extend MinutesSinceHeartbeat = datetime_diff('minute', now(), LastHeartbeat) * -1
    | where MinutesSinceHeartbeat > 10;
AzureActivity
| where TimeGenerated > ago(24h)
| where CategoryValue in ("Administrative", "ResourceHealth", "ServiceHealth")
| project TimeGenerated, ResourceId, OperationNameValue, ActivityStatusValue
| join kind=leftouter MissingHeartbeat on $left.ResourceId == $right._ResourceId

Sample output:

TimeGenerated	ResourceId	OperationNameValue	ActivityStatusValue	Computer	MinutesSinceHeartbeat	Interpretation
2026-04-06T00:58:00Z	/subscriptions//resourceGroups/my-resource-group/providers/Microsoft.Compute/virtualMachines/my-linux-vm	Microsoft.ResourceHealth/healthevent/Activated/action	Active	vm-linux-01	17	Missing heartbeat may align with Azure-side platform health activity.

Surface noisy event log sources¶

Event
| where TimeGenerated > ago(12h)
| summarize EventCount=count() by Computer, Source, EventLevelName
| order by EventCount desc

Sample output:

Computer	Source	EventLevelName	EventCount	Interpretation
vm-win-01	Service Control Manager	Error	46	Repeated service restarts are a likely primary symptom.
vm-win-01	Disk	Warning	18	Storage subsystem issues may be contributing to degraded performance.

VM Insights Setup and Capabilities¶

VM Insights provides a quicker operator experience than raw queries alone because it layers fleet visualizations and dependency views on top of AMA-collected telemetry.

What VM Insights is best for¶

Fleet-wide performance comparison
Fast identification of unhealthy machines
Dependency map review for connected processes and endpoints
Out-of-box charts for CPU, memory, disk, and network trends

What it does not replace¶

DCR design
KQL-based incident-specific investigations
Workload-specific application telemetry

Use VM Insights as the first stop for posture and trend review, then move to Logs when you need detailed OS evidence or cross-resource correlation.

Guest OS Metrics Collection Guidance¶

Guest OS metrics are more precise than Azure resource metrics for many operating system investigations because they reflect the VM interior rather than only the hypervisor view.

Use guest metrics for¶

Available memory and swap pressure
Filesystem free space inside the guest
Per-disk queue or transfer rates
Process and service troubleshooting when combined with Event or Syslog data

Use platform metrics for¶

High-level CPU trend monitoring
Fast alerting with minimal ingestion cost
Azure resource-centric dashboards shared across teams

The strongest operating model is to alert first on platform metrics, then diagnose with guest metrics and logs.

Practical Alert Examples¶

Alert on missing heartbeat¶

az monitor scheduled-query create \
    --name "vm-missing-heartbeat" \
    --resource-group "my-resource-group" \
    --scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.OperationalInsights/workspaces/law-monitoring-prod" \
    --condition "count 'Heartbeat | summarize LastHeartbeat=max(TimeGenerated) by Computer | extend MinutesSinceHeartbeat = datetime_diff(\"minute\", now(), LastHeartbeat) * -1 | where MinutesSinceHeartbeat > 10' > 0" \
    --description "One or more virtual machines stopped sending heartbeats" \
    --evaluation-frequency "5m" \
    --window-size "5m" \
    --severity 1 \
    --action-groups "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

Alert on sustained CPU saturation¶

az monitor metrics alert create \
    --name "vm-cpu-high" \
    --resource-group "my-resource-group" \
    --scopes "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Compute/virtualMachines/my-linux-vm" \
    --condition "avg Percentage CPU > 85" \
    --window-size "15m" \
    --evaluation-frequency "5m" \
    --severity 2 \
    --description "Virtual machine CPU usage is above 85 percent" \
    --action "/subscriptions/<subscription-id>/resourceGroups/my-resource-group/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

Investigation Workflow¶

Heartbeat
- Did the VM stop sending data entirely?
Performance counters / InsightsMetrics
- Was CPU, memory, or disk already degrading before the symptom?
OS logs
- Are there service failures, kernel issues, or driver errors?
Recent changes
- Was the AMA extension updated?
- Did the DCR change?
Workload context
- Is the issue application-specific or host-wide?

Windows and Linux Coverage Guidance¶

Windows VMs
- Collect System and Application event logs for baseline operations
- Add Security logs only when required and sized appropriately
Linux VMs
- Collect Syslog with severity filters to control ingestion
- Include performance counters for CPU, memory, filesystem, and network activity
Both
- Use the same workspace naming and DCR pattern across environments for easier fleet queries

Workbook Suggestions¶

Fleet heartbeat status
Top VMs by CPU and memory utilization
Windows error events by source and event ID
Linux Syslog error trend by host
DCR coverage view to find VMs missing associations

Dashboard and Workbook Recommendations¶

Built-in views to rely on first¶

VM Insights performance workbook
- CPU, memory, disk, and network trend views
- Fast drill-down from fleet to individual VM
Metrics explorer
- Best for low-latency metric alert tuning
Log Analytics workbooks
- Best for combining Heartbeat, InsightsMetrics, Event, Syslog, and AzureActivity data

Recommended workbook tabs¶

Fleet health
- Last heartbeat by VM
- VMs missing AMA extension
- VMs without DCR association
Performance
- CPU, available memory, disk free space
- Top resource-saturated VMs
Operating system evidence
- Windows Event errors by source
- Linux Syslog critical entries by process
Platform correlation
- Resource health events
- Administrative changes
- Alert firing history by VM

Dashboard pins¶

Pin Last heartbeat by VM.
Pin Top VMs by CPU.
Pin Top VMs by low memory.
Pin a log tile for Windows Event errors and another for Linux Syslog critical messages.

Common Pitfalls¶

Mistake 1: Assuming AMA installation alone means monitoring is complete¶

What happens: The extension exists, but no DCR is attached, so expected guest logs and counters never arrive.

Correction: Validate both extension health and DCR association for every monitored VM.

Mistake 2: Collecting too many event logs or counters by default¶

What happens: Ingestion cost grows quickly, and operators struggle to find the high-signal data.

Correction: Start with a small baseline of heartbeat, essential counters, and targeted Windows or Linux logs, then expand only for justified scenarios.

Mistake 3: Ignoring subscription-level platform events during outages¶

What happens: Teams investigate guest OS logs for hours even when Azure resource health already explains the interruption.

Correction: Route ResourceHealth and ServiceHealth to the same workspace and check them early in the incident flow.

Cost Notes¶

Heartbeat and core performance counters are low-cost and high-value; collect them everywhere.
Security or verbose application logs can dominate ingestion cost if sent without filters.
Use DCRs to narrow Windows Event IDs and Linux Syslog facilities instead of collecting everything by default.

Cost Considerations¶

VM observability cost usually scales with guest log breadth rather than with the core VM metric set.

Low-cost baseline
- Heartbeat
- Small set of performance counters at 60-second sampling
- Targeted System/Application events or Syslog severity filtering
Higher-cost patterns
- Broad Windows Security log collection
- Verbose application logs forwarded from the guest
- High-frequency counters with little operational value
Practical estimate
- A baseline of heartbeat plus a few counters is often measured in MB/day per VM, while verbose security or application logging can increase that by an order of magnitude.
Optimization tips
- Use one DCR baseline per environment and attach workload-specific add-on DCRs only where needed.
- Filter Linux Syslog by severity and facility.
- Restrict Windows Event collection to required channels and IDs.
- Review per-table ingestion monthly to confirm Event, Syslog, or custom logs are not crowding out the baseline signals.

VM Observability¶

Data Flow Diagram¶

Core Components¶

Configuration Examples¶

Installing Azure Monitor Agent via CLI¶

Associating a DCR via CLI¶

AMA vs Legacy Agent Comparison¶

KQL Query Examples¶

Monitor VM Heartbeat¶

Analyze CPU Performance Counters¶

Search System Event Logs (Windows)¶

Detect Missing Heartbeats¶

Review Linux Syslog Errors¶

Monitoring Baseline¶

CLI Workflow¶

Verify Azure Monitor Agent extension¶

Review DCR associations¶

Query recent heartbeats¶

Diagnostic Settings and Collection Strategy¶

VM resource diagnostic settings baseline¶

Subscription Activity Log categories to correlate with VM incidents¶

Why this matters¶

Performance Counter Collection Configuration¶

Recommended guest counter baseline¶

Create a DCR with performance counters¶

Additional KQL for Guest OS and VM Insights Analysis¶

Find memory pressure before heartbeat loss¶

Correlate heartbeat gaps with platform health events¶

Surface noisy event log sources¶

VM Insights Setup and Capabilities¶

What VM Insights is best for¶

What it does not replace¶

Guest OS Metrics Collection Guidance¶

Use guest metrics for¶

Use platform metrics for¶

Practical Alert Examples¶

Alert on missing heartbeat¶

Alert on sustained CPU saturation¶

Investigation Workflow¶

Windows and Linux Coverage Guidance¶

Workbook Suggestions¶

Dashboard and Workbook Recommendations¶

Built-in views to rely on first¶

Recommended workbook tabs¶

Dashboard pins¶

Common Pitfalls¶

Mistake 1: Assuming AMA installation alone means monitoring is complete¶

Mistake 2: Collecting too many event logs or counters by default¶

Mistake 3: Ignoring subscription-level platform events during outages¶

Cost Notes¶

Cost Considerations¶

See Also¶

Sources¶