Agent Not Reporting¶

1. Summary¶

An Azure Monitor Agent (AMA) extension appears installed on a VM or Arc-enabled server, but the machine stops sending heartbeats, performance counters, event logs, or syslog records to the Log Analytics workspace. This playbook applies when the problem is tied to AMA-collected data and you need to prove whether the failure is in extension health, DCR association, managed identity, guest runtime state, or outbound connectivity.

The key Microsoft Learn guidance is that a successful extension deployment does not prove the agent is healthy at runtime. AMA needs a valid data collection rule, access to identity and configuration endpoints, local service health, and network reachability to Azure Monitor. Use this playbook when Heartbeat is stale, when a newly onboarded VM never reports, or when only AMA-driven tables are missing while platform logs still arrive.

Typical incident window: 10-20 minutes from first missed heartbeat to human detection when stale-heartbeat alerting is in place. Time to resolution: 30 minutes to 2 hours depending on whether the break is DCR association, identity, guest runtime, or egress.

Use it when:

Heartbeat is stale for one VM, a scale set set, or an Arc server fleet.
The extension says Succeeded, but no new data reaches Perf, InsightsMetrics, or event tables.
Data stopped after network changes, identity changes, or DCR rollout changes.
Only AMA-collected tables are missing; other workspace tables still update.

flowchart TD
    A[AMA not reporting] --> B{Heartbeat stale?}
    B -->|Yes| C[Check extension and service health]
    C --> D{DCR association exists?}
    D -->|No| E[Create association]
    D -->|Yes| F{Identity and endpoint access healthy?}
    F -->|No| G[Fix identity or network]
    F -->|Yes| H[Inspect guest AMA logs]
    H --> I[Repair local runtime or config]

2. Common Misreadings¶

Observation	Often Misread As	Actually Means
VM extension state is `Succeeded`	AMA is healthy	Extension deployment succeeded, but the service may still fail after startup.
`Heartbeat` is empty for one machine	Workspace ingestion is broken	The issue is usually machine-specific: DCR, identity, service state, or egress.
DCR exists in Azure	The VM must be using it	The resource also needs a valid DCR association.
Platform logs for the VM still arrive	AMA is fine	Platform logs use a different pipeline than AMA-collected guest telemetry.
Data stopped after firewall changes	AMA bug	AMA depends on documented Azure Monitor and IMDS endpoints that may now be blocked.
One VM in a subnet stopped reporting while others are healthy	Random host issue	Compare DCR association, identity, and local guest logs before assuming corruption.

3. Competing Hypotheses¶

Hypothesis	Likelihood	Key Discriminator
No DCR association exists for the resource	High	`az monitor data-collection rule association list` returns no association for the VM or Arc resource.
AMA extension or service runtime is unhealthy	High	Extension is missing, failed, or guest logs show local service errors.
Managed identity or access token flow is broken	Medium	Guest logs show authentication or authorization errors and the VM identity is missing or changed.
Endpoint access to IMDS or Azure Monitor is blocked	Medium	DCR exists, extension exists, but the source cannot reach required endpoints.
DCR is present but configured for different streams than expected	Medium	Heartbeat may exist, but the missing table is not included in the DCR data flows.
Workspace-side issue is being blamed incorrectly	Low	Other AMA machines report normally to the same workspace.

4. What to Check First¶

Confirm the VM identity state used by AMA-related calls

az vm show \
    --resource-group $RG \
    --name $VM_NAME \
    --query "identity"

Confirm AMA extension deployment and publisher details

az vm extension list \
    --resource-group $RG \
    --vm-name $VM_NAME \
    --query "[].{name:name,provisioningState:provisioningState,publisher:publisher}"

Query the workspace for current heartbeat state before changing the VM

az monitor log-analytics query \
    --workspace $WORKSPACE_ID \
    --analytics-query "Heartbeat | where TimeGenerated > ago(1d) | summarize LastHeartbeat=max(TimeGenerated) by Computer, _ResourceId | order by LastHeartbeat asc" \
    --timespan "P1D"

List DCR associations on the affected resource

az monitor data-collection rule association list \
    --resource $RESOURCE_ID \
    --output json

Inspect the DCR data flows and destination workspace

az monitor data-collection rule show \
    --resource-group $RG \
    --name $DCR_NAME \
    --output json

If Arc is involved, inspect the connected machine resource state

az connectedmachine show \
    --resource-group $RG \
    --name $MACHINE_NAME \
    --query "{status:status,location:location,id:id}"

5. Evidence to Collect¶

5.1 KQL Queries¶

// Last heartbeat per computer
Heartbeat
| where TimeGenerated > ago(3d)
| summarize LastHeartbeat=max(TimeGenerated) by Computer, OSType, _ResourceId
| order by LastHeartbeat asc
| take 30

Column	Example data	Interpretation
`Computer`	`vm-prod-02`	Target machine for guest-side inspection.
`OSType`	`Windows`	Determines which local service name and log path to use.
`_ResourceId`	`/subscriptions/<subscription-id>/resourceGroups/rg-prod/providers/Microsoft.Compute/virtualMachines/vm-prod-02`	Use this value in DCR association checks.
`LastHeartbeat`	`2026-04-05T05:42:00Z`	Stale values prove the liveness path is broken.

How to Read This

Sort oldest first. If multiple stale machines share a subnet, policy assignment, or DCR, investigate the shared dependency before treating each VM as an isolated incident.

// Compare AMA freshness against other workspace activity
union isfuzzy=true
    (Heartbeat | summarize LastSeen=max(TimeGenerated), Rows=count() by TableName="Heartbeat"),
    (AzureActivity | summarize LastSeen=max(TimeGenerated), Rows=count() by TableName="AzureActivity"),
    (Operation | summarize LastSeen=max(TimeGenerated), Rows=count() by TableName="Operation")
| extend MinutesSinceLastSeen = datetime_diff('minute', now(), LastSeen) * -1
| order by MinutesSinceLastSeen desc

Column	Example data	Interpretation
`TableName`	`Heartbeat`	AMA-driven signal.
`LastSeen`	`2026-04-05T06:00:00Z`	If stale while other tables are fresh, the workspace is not the main issue.
`Rows`	`15243`	History exists even if current ingestion is failing.
`MinutesSinceLastSeen`	`83`	Large gap on `Heartbeat` only points back to source or AMA path.

How to Read This

This query disproves the statement "the workspace is down" when AzureActivity and Operation remain current. That saves time by keeping the investigation on the AMA path.

// Missing or stale heartbeat concentration by resource group pattern
Heartbeat
| where TimeGenerated > ago(3d)
| summarize LastHeartbeat=max(TimeGenerated), Computers=dcount(Computer) by ResourceGroup=extract(@"resourceGroups/([^/]+)/", 1, _ResourceId)
| order by LastHeartbeat asc

Column	Example data	Interpretation
`ResourceGroup`	`rg-prod-eastus`	Shared administrative boundary.
`LastHeartbeat`	`2026-04-05T04:58:00Z`	If many machines in one group are stale, look for shared policy or rollout issues.
`Computers`	`18`	Higher counts support a shared root cause.

How to Read This

A resource-group cluster of failures usually means a shared DCR, policy, or network dependency changed. A single-host failure usually means local runtime or identity drift.

// Estimate ingestion delay for recent heartbeats where data still flows
Heartbeat
| where TimeGenerated > ago(6h)
| extend DelayMinutes = datetime_diff('minute', ingestion_time(), TimeGenerated)
| summarize AvgDelay=avg(DelayMinutes), P95Delay=percentile(DelayMinutes, 95), MaxDelay=max(DelayMinutes) by OSType

Column	Example data	Interpretation
`OSType`	`Linux`	Compare platform-specific delay patterns.
`AvgDelay`	`1.1`	Healthy heartbeat flow should remain low.
`P95Delay`	`14`	High delay can make the agent look down when it is only degraded.
`MaxDelay`	`31`	Large spikes justify checking endpoint latency and backlog.

How to Read This

Use this to avoid false positives. If stale data later appears, you may have a degraded transport path instead of a dead agent.

5.2 CLI Investigation¶

# Check VM identity and provisioning basics
az vm show \
    --resource-group $RG \
    --name $VM_NAME \
    --output json

Sample output:

{
  "id": "/subscriptions/<subscription-id>/resourceGroups/rg-prod/providers/Microsoft.Compute/virtualMachines/vm-prod-02",
  "identity": {
    "type": "SystemAssigned",
    "principalId": "<object-id>",
    "tenantId": "<tenant-id>"
  },
  "location": "eastus",
  "name": "vm-prod-02"
}

Interpretation:

Missing identity can block AMA token acquisition scenarios that depend on managed identity.
Identity drift after rebuilds or automation changes is a common hidden cause.
Capture this before spending time on guest-side remediation.

# Confirm AMA extension deployment on the VM
az vm extension list \
    --resource-group $RG \
    --vm-name $VM_NAME \
    --output table

Sample output:

Name                    Publisher                   ProvisioningState
----------------------  --------------------------  -----------------
AzureMonitorWindowsAgent Microsoft.Azure.Monitor    Succeeded

Interpretation:

No AMA row means there is no agent to troubleshoot.
Succeeded narrows the problem to runtime state, DCR, identity, or network.
Extension name differs by OS, so compare against the expected Windows or Linux package.

# Check DCR association for the affected resource
az monitor data-collection rule association list \
    --resource $RESOURCE_ID \
    --output json

Sample output:

[
  {
    "dataCollectionRuleId": "/subscriptions/<subscription-id>/resourceGroups/rg-monitor/providers/Microsoft.Insights/dataCollectionRules/dcr-vm-baseline",
    "name": "ama-baseline",
    "provisioningState": "Succeeded"
  }
]

Interpretation:

Empty result means the VM cannot receive AMA collection instructions.
If the wrong DCR is associated, data may flow but not for the expected streams.
Use the returned DCR ID in the next command to inspect data flows.

# Inspect DCR streams and destination workspace
az monitor data-collection rule show \
    --resource-group $RG \
    --name $DCR_NAME \
    --output json

Sample output:

{
  "dataFlows": [
    {
      "streams": [
        "Microsoft-Perf",
        "Microsoft-Event",
        "Microsoft-Syslog"
      ],
      "destinations": [
        "la-workspace"
      ]
    }
  ],
  "destinations": {
    "logAnalytics": [
      {
        "workspaceResourceId": "/subscriptions/<subscription-id>/resourceGroups/rg-monitor/providers/Microsoft.OperationalInsights/workspaces/law-prod"
      }
    ]
  }
}

Interpretation:

If the missing stream is not listed, the agent may be healthy but collecting the wrong data set.
Wrong workspace destination explains why a machine appears silent in the expected workspace.
DCR content also helps prove whether this is one-machine drift or a fleet-wide configuration issue.

6. Validation and Disproof by Hypothesis¶

Hypothesis 1: No DCR association exists for the resource¶

Proves if: Section 5.2 CLI command 3 returns no association or a failed association.

Disproves if: A valid association exists and points to the intended DCR.

Test with: Section 5.2 CLI command 3.

Hypothesis 2: AMA extension or service runtime is unhealthy¶

Proves if: Section 5.1 Query 1 shows stale heartbeat and guest logs show service startup, plugin, or configuration errors.

Disproves if: Heartbeat is fresh or guest logs show healthy runtime.

Test with: Section 5.1 Query 1 plus Section 5.2 CLI command 2, then inspect guest logs under the Microsoft Learn-documented AMA log locations.

Hypothesis 3: Managed identity or token flow is broken¶

Proves if: Section 5.2 CLI command 1 shows missing or changed identity and AMA guest logs show authentication failures.

Disproves if: Identity is present and guest logs show successful configuration retrieval.

Test with: Section 5.2 CLI command 1 and guest-side AMA logs.

Hypothesis 4: Endpoint access to IMDS or Azure Monitor is blocked¶

Proves if: DCR and extension are present but the guest cannot reach IMDS or required Azure Monitor endpoints.

Disproves if: Endpoint checks succeed and another hypothesis fits better.

Test with: Section 5.2 CLI commands 3 and 4 for expected config, then validate network reachability from the guest host.

Hypothesis 5: DCR exists but does not contain the expected streams¶

Proves if: Heartbeat may exist but the missing data type is absent from dataFlows.streams.

Disproves if: The stream is present and properly routed.

Test with: Section 5.2 CLI command 4.

Hypothesis 6: Workspace-side issue is being blamed incorrectly¶

Proves if: Other machines still report to the same workspace and Section 5.1 Query 2 shows other workspace tables are current.

Disproves if: Many unrelated senders are stale too.

Test with: Section 5.1 Queries 2 and 3.

7. Likely Root Cause Patterns¶

Pattern	Evidence	Resolution
VM onboarded with AMA but never associated to a DCR	Extension exists, heartbeat absent, association list is empty	Create the association and confirm the correct workspace destination.
Shared DCR changed and removed a required stream	Heartbeat exists but one table stopped after DCR modification	Restore the needed stream or attach the correct DCR.
Firewall or proxy change blocked required endpoints	Multiple machines in one subnet stopped together and guest logs show reachability failures	Restore access to IMDS and Azure Monitor endpoints.
Identity drift after rebuild or policy change	VM identity missing or new principal is not functioning for AMA	Re-enable managed identity and verify token retrieval.
Guest runtime corruption or service failure	Extension is present but guest logs show repeated startup errors	Repair or redeploy the AMA extension and restart the service.

Normal vs Abnormal Comparison¶

Metric/Log	Normal State	Abnormal State	Threshold
`Heartbeat` cadence	New row roughly every minute per healthy machine	No fresh row for the monitored machine	> 5 min gap
AMA extension state	Extension present with `Succeeded` provisioning state	Extension missing, failed, or wrong publisher/name	Any non-healthy state
DCR association	At least one expected association exists for the resource	Association list is empty	Zero expected associations
DCR stream coverage	Required streams such as `Microsoft-Perf`, `Microsoft-Event`, or `Microsoft-Syslog` are present	Missing stream explains missing downstream table	Missing required stream
Workspace control query	Other machines still emit fresh heartbeat to the same workspace	Many unrelated machines are also stale	Shared-staleness across fleet

8. Immediate Mitigations¶

Recreate the missing DCR association.

az monitor data-collection rule association create \
    --name ama-baseline \
    --resource $RESOURCE_ID \
    --rule-id $DCR_ID

Reinstall or update AMA if the extension is missing or damaged.

az vm extension set \
    --resource-group $RG \
    --vm-name $VM_NAME \
    --publisher Microsoft.Azure.Monitor \
    --name AzureMonitorLinuxAgent \
    --enable-auto-upgrade true

Restart AMA after configuration repair.

az vm run-command invoke \
    --resource-group $RG \
    --name $VM_NAME \
    --command-id RunShellScript \
    --scripts "sudo systemctl restart azuremonitoragent"

Restore system-assigned identity when it was removed.

az vm identity assign \
    --resource-group $RG \
    --name $VM_NAME

If the issue is stream mismatch, attach the correct DCR or restore missing streams immediately.

az monitor data-collection rule show \
    --resource-group $RG \
    --name $DCR_NAME \
    --query "dataFlows"

9. Prevention¶

Prevent AMA reporting failures by treating onboarding as a four-part contract: extension, identity, DCR association, and endpoint access. Missing any one of them causes silent collection gaps.

Audit DCR associations regularly for monitored machines.

az monitor data-collection rule association list \
    --resource $RESOURCE_ID \
    --output table

Keep identity configuration explicit in VM build templates and post-deployment validation.

az vm show \
    --resource-group $RG \
    --name $VM_NAME \
    --query "identity.type"

Use standard network validation for Azure Monitor and IMDS endpoints when firewalls or private routing change.

az monitor data-collection rule show \
    --resource-group $RG \
    --name $DCR_NAME \
    --query "destinations"

Create a stale-heartbeat alert so a failed agent path is detected quickly.

az monitor scheduled-query create \
    --name "ama-heartbeat-stale" \
    --resource-group "$RG" \
    --scopes "$WORKSPACE_ID" \
    --condition "count 'StaleHeartbeat' > 0" \
    --condition-query "StaleHeartbeat=Heartbeat | where TimeGenerated < ago(15m)" \
    --evaluation-frequency "5m" \
    --window-size "5m" \
    --severity 2 \
    --skip-query-validation true \
    --description "Trigger when AMA heartbeat data is older than fifteen minutes." \
    --output json

Finally, review guest log paths during runbook design. Microsoft Learn explicitly directs troubleshooters to local AMA logs because workspace queries alone cannot explain why a machine never started sending.

Agent Not Reporting¶

1. Summary¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

5. Evidence to Collect¶

5.1 KQL Queries¶

5.2 CLI Investigation¶

6. Validation and Disproof by Hypothesis¶

Hypothesis 1: No DCR association exists for the resource¶

Hypothesis 2: AMA extension or service runtime is unhealthy¶

Hypothesis 3: Managed identity or token flow is broken¶

Hypothesis 4: Endpoint access to IMDS or Azure Monitor is blocked¶

Hypothesis 5: DCR exists but does not contain the expected streams¶

Hypothesis 6: Workspace-side issue is being blamed incorrectly¶

7. Likely Root Cause Patterns¶

Normal vs Abnormal Comparison¶

8. Immediate Mitigations¶

9. Prevention¶

See Also¶

Sources¶