Alert Not Firing¶
1. Summary¶
An Azure Monitor alert rule exists, the underlying symptom appears real, but no alert is created or no notification reaches the responders. This playbook applies to metric alerts, scheduled query alerts, and the handoff from rule evaluation to action group delivery. Use it when teams say "the threshold definitely happened" yet the incident history shows no matching alert or only a partial alert lifecycle.
Microsoft Learn troubleshooting guidance separates this scenario into three layers: the monitored signal, the alert rule evaluation logic, and the notification path. Most "did not fire" incidents are not platform outages. They are timing mismatches, aggregation misunderstandings, disabled rules, alert processing rules, or action group delivery failures.
Typical incident window: 5-15 minutes for metric alerts and 10-20 minutes for scheduled query rules, depending on evaluation frequency and window size. Time to resolution: 30 minutes to 2 hours depending on whether the miss is signal logic, rule state, suppression, or receiver delivery.
Use it when:
- A metric clearly spiked, but the metric alert did not activate.
- A log query returns rows in Logs, but the scheduled query rule did not fire.
- An alert activated in history, but email, webhook, ITSM, or automation never triggered.
- Maintenance suppression, auto-mitigation, or dimension configuration may have changed the outcome.
flowchart TD
A[Alert did not fire] --> B{Did the signal truly meet rule logic?}
B -->|No| C[Fix threshold, aggregation, window, or query]
B -->|Yes| D{Was the rule enabled and evaluated?}
D -->|No| E[Fix rule state or scope]
D -->|Yes| F{Was notification suppressed or failed?}
F -->|Suppressed| G[Review alert processing rules]
F -->|Failed| H[Review action group delivery]
F -->|No evidence| I[Check evaluation timing and ingestion delay] 2. Common Misreadings¶
| Observation | Often Misread As | Actually Means |
|---|---|---|
| Portal chart shows a spike | Alert should have fired automatically | Metric alerts evaluate aggregation, frequency, dimensions, and threshold, not screenshots of the chart. |
| Log query returns rows right now | The scheduled query rule should have fired earlier | The rule ran on an earlier window and may have seen no rows yet because of ingestion delay or different time grain. |
| Alert history is empty | Notification failed | The rule may never have activated, so the investigation must start before action groups. |
| Alert is in history but no email arrived | Rule logic is wrong | The rule may have worked and the failure is in action group delivery or suppression. |
| Dynamic threshold or dimensions were enabled | Azure Monitor is unpredictable | Those features change how the signal is evaluated and can suppress what looks obvious in a raw chart. |
| Auto-mitigated alert resolved quickly | The alert never happened | The alert may have activated and resolved before responders saw the notification. |
3. Competing Hypotheses¶
| Hypothesis | Likelihood | Key Discriminator |
|---|---|---|
| The signal never met the exact rule logic | High | Manual replay of the same aggregation, dimensions, and window disproves the assumption that the threshold was crossed. |
| The alert rule was disabled, mis-scoped, or unhealthy | High | CLI shows disabled state, wrong scopes, or unexpected rule configuration. |
| Evaluation timing or ingestion delay prevented activation | Medium | Log data appears later than the rule window, or the metric spike was shorter than the evaluation criteria. |
| An alert processing rule suppressed notifications | Medium | Alert history shows activation or expected conditions, but processing rules target the scope and time. |
| Action group delivery failed | Medium | Alert history exists but email, webhook, SMS, or automation execution failed. |
| Dimensions or dynamic thresholds changed the expected result | Low | The rule uses per-dimension evaluation or adaptive thresholds that differ from the operator's mental model. |
4. What to Check First¶
-
Show metric alert rule state, scope, window, and evaluation frequency
-
If it is a scheduled query rule, inspect evaluation settings and scopes
-
Replay a control query against the workspace when the rule is log-based
-
List action groups wired to the rule
-
Review alert processing rules that can suppress notifications
-
Confirm the target resource and dimensions actually match the rule scope
5. Evidence to Collect¶
5.1 KQL Queries¶
// Alert-related Azure Activity entries for recent rule operations
AzureActivity
| where TimeGenerated > ago(7d)
| where OperationNameValue has_any ("Microsoft.Insights/metricAlerts", "Microsoft.Insights/scheduledQueryRules", "Microsoft.Insights/actionGroups")
| project TimeGenerated, OperationNameValue, ActivityStatusValue, ResourceGroup, ResourceId, Caller
| order by TimeGenerated desc
| Column | Example data | Interpretation |
|---|---|---|
OperationNameValue | Microsoft.Insights/scheduledQueryRules/write | Confirms when the rule was created or updated. |
ActivityStatusValue | Succeeded | A recent config change may align with the missed alert. |
ResourceId | /subscriptions/<subscription-id>/resourceGroups/rg-monitor/providers/Microsoft.Insights/scheduledQueryRules/cpu-breach | Use this to confirm the exact rule under investigation. |
Caller | <object-id> | Identify who or what changed the rule. |
How to Read This
Start with recent writes or deletes. Many missed alerts follow an innocent change to frequency, scopes, dimensions, or action groups rather than a platform malfunction.
// Check whether action-group delivery failures were recorded
AzureActivity
| where TimeGenerated > ago(7d)
| where OperationNameValue == "Microsoft.Insights/actionGroups/notification/action"
| summarize Attempts=count(), Failures=countif(ActivityStatusValue == "Failed") by ResourceGroup, ActivityStatusValue, bin(TimeGenerated, 1h)
| order by TimeGenerated desc
| Column | Example data | Interpretation |
|---|---|---|
Attempts | 12 | Notification pipeline was exercised. |
Failures | 3 | Nonzero values point to delivery issues, not evaluation logic. |
ActivityStatusValue | Failed | Failures justify checking action group endpoints and receivers. |
TimeGenerated | hourly bucket | Correlate to the expected incident window. |
How to Read This
If this query is empty and alert history is also empty, stop blaming the action group. The rule probably never activated.
// Example replay pattern for a scheduled query alert signal
Perf
| where TimeGenerated between (ago(2h) .. now())
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 5m), Computer
| order by TimeGenerated asc
| Column | Example data | Interpretation |
|---|---|---|
Computer | vm-prod-01 | Evaluate whether the rule should have fired per machine or across the fleet. |
TimeGenerated | 5-minute bucket | Match this to the alert evaluation frequency and window. |
AvgCPU | 91.7 | Compare to the actual rule threshold and operator. |
| Series shape | brief spike | A short spike can look severe in a chart but still fail the alert's evaluation logic. |
How to Read This
Replay the signal exactly as the rule evaluates it. If the rule uses 5-minute averages and two evaluation periods, a one-minute spike is not enough evidence.
// Estimate ingestion delay for a log-search alert source table
Perf
| where TimeGenerated > ago(6h)
| extend DelayMinutes = datetime_diff('minute', ingestion_time(), TimeGenerated)
| summarize AvgDelay=avg(DelayMinutes), P95Delay=percentile(DelayMinutes, 95), MaxDelay=max(DelayMinutes)
| Column | Example data | Interpretation |
|---|---|---|
AvgDelay | 1.6 | Typical ingestion lag. |
P95Delay | 9 | If this approaches the rule window, misses become plausible. |
MaxDelay | 17 | Large delay spikes justify widening alert windows or fixing the source pipeline. |
How to Read This
Scheduled query rules are only as good as the data they can see in their evaluation window. If delay is close to or larger than that window, late data can make the rule look broken.
5.2 CLI Investigation¶
# Inspect a metric alert rule
az monitor metrics alert show \
--resource-group $RG \
--name $ALERT_RULE_NAME \
--output json
Sample output:
{
"enabled": true,
"evaluationFrequency": "PT1M",
"scopes": [
"/subscriptions/<subscription-id>/resourceGroups/rg-prod/providers/Microsoft.Compute/virtualMachines/vm-prod-01"
],
"severity": 2,
"windowSize": "PT5M"
}
Interpretation:
enabledmust be true.evaluationFrequencyandwindowSizedefine what a real threshold crossing means.- Incorrect
scopesare a common explanation for "this alert never fires."
# Inspect a scheduled query rule
az monitor scheduled-query show \
--resource-group $RG \
--name $ALERT_RULE_NAME \
--output json
Sample output:
{
"enabled": true,
"evaluationFrequency": "PT5M",
"scopes": [
"/subscriptions/<subscription-id>/resourceGroups/rg-monitor/providers/Microsoft.OperationalInsights/workspaces/law-prod"
],
"windowSize": "PT10M"
}
Interpretation:
- Replay the rule against the same
windowSizeandevaluationFrequency. - Disabled or mis-scoped rules are configuration problems, not alerting bugs.
- Keep this output beside the manual KQL replay to avoid comparing different windows.
# List alert processing rules that may suppress notifications
az monitor alert-processing-rule list \
--resource-group $RG \
--output json
Sample output:
[
{
"enabled": true,
"name": "suppress-maintenance-window",
"scopes": [
"/subscriptions/<subscription-id>/resourceGroups/rg-prod"
]
}
]
Interpretation:
- If scope and schedule overlap the incident, notifications may be intentionally suppressed.
- This does not necessarily stop the alert from evaluating; it changes downstream delivery.
- Always compare suppression scope with the rule target resource.
# Inspect the action group tied to the rule
az monitor action-group show \
--resource-group $RG \
--name $ACTION_GROUP_NAME \
--output json
Sample output:
{
"enabled": true,
"emailReceivers": [
{
"name": "oncall-email",
"status": "Enabled"
}
],
"webhookReceivers": [
{
"name": "incident-webhook",
"serviceUri": "https://example.invalid/alerts"
}
]
}
Interpretation:
- Disabled or misconfigured receivers explain missing notifications after activation.
- Check webhook endpoints and automation permissions if delivery is inconsistent.
- Keep the action group investigation separate from rule evaluation logic.
6. Validation and Disproof by Hypothesis¶
Hypothesis 1: The signal never met the exact rule logic¶
Proves if: Section 5.1 Query 3 fails to cross the actual threshold across the actual window and dimensions.
Disproves if: Manual replay exactly matches the rule logic and should have activated.
Test with: Section 5.1 Query 3 plus Section 5.2 CLI command 1 or 2.
Hypothesis 2: The alert rule was disabled, mis-scoped, or unhealthy¶
Proves if: CLI shows enabled: false, wrong scope, or configuration drift.
Disproves if: Rule state and scope are correct.
Test with: Section 5.2 CLI command 1 or 2 and Section 5.1 Query 1.
Hypothesis 3: Evaluation timing or ingestion delay prevented activation¶
Proves if: Data arrives after the rule window or only brief spikes occur inside a longer evaluation window.
Disproves if: Delay is low and the signal sustained the threshold across the required period.
Test with: Section 5.1 Queries 3 and 4.
Hypothesis 4: An alert processing rule suppressed notifications¶
Proves if: Processing rules target the same scope and time as the missed incident.
Disproves if: No relevant processing rule exists.
Test with: Section 5.2 CLI command 3.
Hypothesis 5: Action group delivery failed¶
Proves if: Alert history or Azure Activity shows activation but notification actions failed.
Disproves if: There is no evidence the alert ever activated.
Test with: Section 5.1 Query 2 plus Section 5.2 CLI command 4.
Hypothesis 6: Dimensions or dynamic thresholds changed the expected result¶
Proves if: The rule uses dimensions or adaptive logic that differ from the operator's broad chart view.
Disproves if: The rule uses simple static thresholds with no dimension splits.
Test with: Section 5.2 CLI command 1 or 2 and manual signal replay per dimension.
7. Likely Root Cause Patterns¶
| Pattern | Evidence | Resolution |
|---|---|---|
| Threshold replay was done with the wrong window | Operator uses raw chart, but rule uses averaged evaluation periods | Replay with exact rule window, frequency, and dimension handling. |
| Rule was disabled or scope drifted after maintenance | Azure Activity shows a recent write and CLI reveals changed settings | Restore the intended enabled state and target scope. |
| Scheduled query alert window was too short for ingestion delay | KQL results appear after the expected evaluation window | Increase evaluation window or fix source-side delay. |
| Alert processing rule suppressed expected notification | Rule history exists or conditions were met, but suppression scope matches the resource | Narrow or disable the suppression during the relevant period. |
| Action group receiver failed after activation | Notification attempts show failures while rule configuration is otherwise correct | Repair receiver configuration, endpoint health, or permissions. |
Normal vs Abnormal Comparison¶
| Metric/Log | Normal State | Abnormal State | Threshold |
|---|---|---|---|
| Metric alert evaluation | Signal exceeds threshold across the configured windowSize and evaluationFrequency | Signal spikes briefly but does not satisfy the actual rule window | Rule-specific |
| Scheduled query alert data freshness | Source table data arrives inside the evaluation window | Ingestion delay pushes rows outside the evaluation window | Delay near windowSize |
| Alert rule state | enabled is true and scope matches intended target | Disabled rule or wrong scope/resource | Any mismatch |
| Alert processing rules | No overlapping suppression for the incident scope and time | Scope and schedule suppress expected notifications | Any overlapping suppress rule |
| Action group delivery | Receivers enabled and reachable | Alert activates but notifications fail or never leave the action group | Any failed delivery |
8. Immediate Mitigations¶
-
Re-enable the alert rule if it was disabled.
-
Widen the scheduled query alert window when late data is the issue.
-
Disable or narrow an overlapping alert processing rule.
-
Repair action group receivers and revalidate delivery.
-
If dimensions caused the miss, temporarily simplify to the exact monitored target.
9. Prevention¶
Prevent missed alerts by validating the whole alert path, not just the threshold. Replay the rule logic against real historical data before relying on it in production.
Keep alert rule configuration review in change management.
az monitor scheduled-query show \
--resource-group $RG \
--name $ALERT_RULE_NAME \
--query "{enabled:enabled,scopes:scopes,evaluationFrequency:evaluationFrequency,windowSize:windowSize}"
Avoid windows shorter than normal ingestion latency for log alerts.
az monitor scheduled-query update \
--resource-group $RG \
--name $ALERT_RULE_NAME \
--window-size 15m
Keep suppression rules narrow in scope and duration.
Test action groups as part of onboarding and after receiver changes.
Finally, document whether each alert is per-resource, per-dimension, or fleet-wide. Most operator confusion comes from expecting one model while the rule actually uses another.
See Also¶
Sources¶
- Microsoft Learn: Troubleshoot Azure Monitor alerts
- Microsoft Learn: Troubleshoot metric alerts in Azure Monitor
- Microsoft Learn: Alerts processing rules in Azure Monitor
- Microsoft Learn: Action groups in Azure Monitor
- Microsoft Learn: Log search alerts in Azure Monitor
- Microsoft Learn: Perf table reference