Slow Query Performance¶

1. Summary¶

This playbook applies when Log Analytics, workbook, or scheduled query rule KQL is materially slower than expected, intermittently hangs, or times out before operators get usable evidence. The default misdiagnosis is to blame Azure Monitor itself, but Microsoft Learn guidance shows slow queries usually trace back to a smaller set of causes: too much data scanned, weak predicates, expensive operators placed too early, high-cardinality joins or summarizes, or workbook and cross-workspace designs that silently expand scope.

Use this playbook when a query is functionally correct but operationally unhealthy. That includes slow workbook visuals, heavy alert-rule evaluation, ad hoc Logs queries that work only on very small windows, and repeated use of search *, broad contains, or wide join / union patterns on hot tables such as AzureDiagnostics, ContainerLogV2, AppTraces, and AppRequests.

Typical incident window: 10-30 minutes from first timeout or slow workbook report to confirmation that the issue is query shape, data volume, or true workspace degradation. Time to resolution: 30 minutes to 2 hours depending on whether remediation is a query rewrite, reduced scope, or ingestion/data-shape change.

flowchart TD
    A[Symptom: slow or timing-out KQL query] --> B{Is the time range broad?}
    B -->|Yes| H1[H1: Query scans too much data]
    B -->|No| C{Are parse or join operators expensive?}
    C -->|Yes| H2[H2: Query shape is inefficient]
    C -->|No| D{Is only one table or workload affected?}
    D -->|Yes| H3[H3: Table-specific volume or schema issue]
    D -->|No| E{Are narrow control queries also slow?}
    E -->|Yes| H4[H4: Workspace or service-side issue]
    E -->|No| F{Does workbook or cross-workspace scope expand the query?}
    F -->|Yes| H5[H5: Workbook or federation overhead]
    F -->|No| H6[H6: Weak predicates or high-cardinality grouping]

2. Common Misreadings¶

Observation	Often Misread As	Actually Means
Query times out at 10 minutes	Log Analytics is down	The query may be scanning too much data or performing late filtering before timeout is reached.
Workbook chart is slow but a simple `take 10` query is fast	Portal issue only	Workbook logic, parameters, or joins are expensive even if the portal shell is healthy.
`search *` finds the answer reliably	It is a safe default for troubleshooting	It forces broad scans and is usually the most expensive starting point.
`contains` returns the right rows	It is as efficient as `has` or `==`	Broad string operators are often much less selective and can inflate scan cost.
One saved query is slow while another is fine	Random variance	Saved queries usually differ in time range, predicate selectivity, join cardinality, or summarize strategy.
A query returning only a few rows cannot be expensive	Small results mean small work	The engine may still scan gigabytes before producing a tiny output set.

3. Competing Hypotheses¶

Hypothesis	Likelihood	Key Discriminator
H1: The query scans too much data for the selected time range	High	Usage and row-count checks spike for the same interval or table the query targets.
H2: The KQL shape is inefficient	High	Moving `where TimeGenerated`, resource filters, and `project` earlier reduces duration and scanned data materially.
H3: One table has unexpectedly high volume or expensive dynamic fields	Medium	Slowdown is concentrated on one table such as `AzureDiagnostics`, `ContainerLogV2`, `AppTraces`, or `AppRequests`.
H4: Workspace or service-side health issue exists	Low-Medium	Even simple narrow control queries are slow across unrelated tables and users.
H5: Workbook, alert-rule, or cross-workspace design adds overhead	Medium	The same logic is faster when tested directly in Logs or against one workspace only.
H6: Nonselective text search or high-cardinality grouping is the real bottleneck	Medium	Replacing `search` / `contains` or aggregating on coarser dimensions changes runtime sharply.

4. What to Check First¶

Capture the exact workspace identity before replaying the query

az monitor log-analytics workspace show \
    --resource-group "$RG" \
    --workspace-name "$WORKSPACE_NAME" \
    --query "{id:id,customerId:customerId,sku:sku.name,retentionInDays:retentionInDays,publicNetworkAccessForQuery:publicNetworkAccessForQuery}"

Run a narrow control query against the workspace

az monitor log-analytics query \
    --workspace "$WORKSPACE_ID" \
    --analytics-query "Heartbeat | where TimeGenerated > ago(15m) | summarize LastHeartbeat=max(TimeGenerated) by Computer | take 5" \
    --timespan "PT15M"

Replay recent usage for the heaviest tables

az monitor log-analytics query \
    --workspace "$WORKSPACE_ID" \
    --analytics-query "Usage | where TimeGenerated > ago(24h) | summarize TotalGB=round(sum(Quantity)/1024.0,2) by DataType | order by TotalGB desc | take 10" \
    --timespan "P1D"

If the query powers an alert, inspect the scheduled query rule cadence

az monitor scheduled-query show \
    --resource-group "$RG" \
    --name "$ALERT_RULE_NAME" \
    --output json

Inspect table settings when one hot table dominates the query path

az monitor log-analytics workspace table show \
    --resource-group "$RG" \
    --workspace-name "$WORKSPACE_NAME" \
    --name "$TABLE_NAME" \
    --output json

Check whether Azure Monitor currently has a service-health event

az servicehealth events list \
    --query "[?contains(title, 'Azure Monitor') || contains(title, 'Log Analytics')].{title:title,status:status}"

5. Evidence to Collect¶

Collect evidence from both KQL and CLI so you can distinguish query-design issues from workspace-level issues. Prefer one shared time window. If the symptom is in a workbook, extract the underlying KQL and test it directly in Logs before changing workbook JSON or blaming the portal.

5.1 KQL Queries¶

Query 1: Table volume baseline¶

Usage
| where TimeGenerated > ago(24h)
| summarize TotalGB = round(sum(Quantity) / 1024.0, 2) by DataType
| order by TotalGB desc
| take 10

Sample Output

DataType	TotalGB	Interpretation
ContainerLogV2	18.74	Largest scanned table is likely to dominate broad troubleshooting queries.
AppTraces	9.41	String-heavy trace searches can be expensive without early filters.
AzureDiagnostics	6.12	Multi-service schema usually needs explicit `ResourceProvider` or `Category` filters.

How to Read This

Start with the top-volume tables. If your slow query targets one of them, assume data volume is part of the problem until disproved.

Query 2: Expensive predicate baseline¶

AppServiceHTTPLogs
| where TimeGenerated > ago(24h)
| where CsUriStem contains "/api/"
| summarize Requests = count() by CsUriStem
| order by Requests desc

Sample Output

CsUriStem	Requests	Interpretation
/api/orders	125000	High-row paths need narrower windows or more selective filters.
/api/catalog	73110	Broad text search scales poorly on busy routes.

How to Read This

Keep this as the expensive baseline. If a more selective operator changes runtime materially, you found a query-shape problem rather than a platform problem.

Query 3: Narrow projection and selective operator pattern¶

AppServiceHTTPLogs
| where TimeGenerated > ago(1h)
| where CsUriStem has "/api/orders"
| project TimeGenerated, CsUriStem, ScStatus, TimeTaken
| summarize Requests = count(), P95 = percentile(TimeTaken, 95) by bin(TimeGenerated, 5m)
| order by TimeGenerated asc

Sample Output

TimeGenerated	Requests	P95	Interpretation
2026-04-05 09:00:00	1840	1420	Tighter time range preserves the business answer with far less work.
2026-04-05 09:05:00	1774	1358	Early `project` prevents unused columns from flowing into `summarize`.

How to Read This

Optimization is not only about speed. The best rewrite keeps the same analytical answer while reducing scanned rows and carried columns.

5.2 CLI Investigation¶

Before changing the original query, capture control results so you can compare later. Microsoft Learn optimization guidance is most useful when you can show that the same workspace answers narrow control queries quickly while the slow query remains expensive.

Command 1: Check workspace configuration and query access state¶

az monitor log-analytics workspace show \
    --resource-group "$RG" \
    --workspace-name "$WORKSPACE_NAME" \
    --query "{id:id,name:name,sku:sku.name,retentionInDays:retentionInDays,dailyQuotaGb:workspaceCapping.dailyQuotaGb,publicNetworkAccessForQuery:publicNetworkAccessForQuery}"

Sample Output (sanitized)

{
  "dailyQuotaGb": 200,
  "id": "/subscriptions/<subscription-id>/resourceGroups/rg-monitor/providers/Microsoft.OperationalInsights/workspaces/law-production",
  "name": "law-production",
  "publicNetworkAccessForQuery": "Enabled",
  "retentionInDays": 30,
  "sku": "PerGB2018"
}

Interpretation: Workspace settings are normal and do not immediately explain one slow query. This weakens H4 unless many workspaces show the same behavior.

Command 2: Run a narrow control query from CLI¶

az monitor log-analytics query \
    --workspace "$WORKSPACE_ID" \
    --analytics-query "Heartbeat | where TimeGenerated > ago(15m) | summarize LastHeartbeat=max(TimeGenerated) by Computer | take 5" \
    --timespan "PT15M"

Sample Output (sanitized)

Computer            LastHeartbeat
------------------  -------------------------
aks-nodepool1-0001  2026-04-05T10:42:12Z
aks-nodepool1-0002  2026-04-05T10:42:09Z
vm-prod-01          2026-04-05T10:42:11Z

Interpretation: If this finishes quickly while the user query is slow, the workspace is responsive and the original KQL design remains the primary suspect.

Command 3: Check recent table usage trend¶

az monitor log-analytics query \
    --workspace "$WORKSPACE_ID" \
    --analytics-query "Usage | where TimeGenerated > ago(6h) | summarize TotalGB=round(sum(Quantity)/1024.0,2) by DataType | order by TotalGB desc | take 5" \
    --timespan "PT6H"

Sample Output (sanitized)

DataType         TotalGB
---------------  -------
ContainerLogV2   5.26
AppTraces        2.01
InsightsMetrics  1.32

Interpretation: High recent ingestion in the same table the slow query reads strongly supports H1 or H3 rather than H4.

Command 4: Inspect the scheduled query rule if the slow KQL powers an alert¶

az monitor scheduled-query show \
    --resource-group "$RG" \
    --name "$ALERT_RULE_NAME" \
    --output json

Sample Output (sanitized)

{
  "enabled": true,
  "evaluationFrequency": "PT5M",
  "scopes": [
    "/subscriptions/<subscription-id>/resourceGroups/rg-monitor/providers/Microsoft.OperationalInsights/workspaces/law-production"
  ],
  "windowSize": "PT15M"
}

Interpretation: A heavy query in a short evaluation cycle is both a latency risk and an operational cost risk.

6. Validation and Disproof by Hypothesis¶

Hypothesis H1: The query scans too much data¶

Proves if: Narrowing the time range from 24 hours to 1 hour materially reduces duration and scanned data while preserving the same logical operators.

Disproves if: The query remains slow even for a 15-minute range and a single table with selective filters.

Tests

AzureDiagnostics
| where TimeGenerated > ago(1h)
| summarize Count = count() by Category
| order by Count desc

AzureDiagnostics
| where TimeGenerated > ago(24h)
| summarize Count = count() by Category
| order by Count desc

If the one-hour version is fast and the 24-hour version is slow, the query is volume-sensitive rather than generically broken.

Hypothesis H2: The KQL shape is inefficient¶

Proves if: Reordering the same query so that where TimeGenerated, resource filters, and project happen early dramatically improves execution.

Disproves if: The optimized shape performs almost identically to the original shape and scanned data stays low.

Tests

// Original pattern
ContainerLogV2
| extend Parsed = parse_json(LogMessage)
| where TimeGenerated > ago(1h)
| where tostring(Parsed.level) == "error"
| summarize Count = count() by PodName

// Improved pattern
ContainerLogV2
| where TimeGenerated > ago(1h)
| where LogMessage has '"level":"error"'
| project TimeGenerated, PodName, LogMessage
| extend Parsed = parse_json(LogMessage)
| summarize Count = count() by PodName

If the improved pattern is substantially faster, late parsing was the dominant issue.

Hypothesis H3: One table has unexpectedly high volume or expensive schema¶

Proves if: Slowness is concentrated on one table and lighter tables remain fast over the same time window.

Disproves if: Simple probes against multiple unrelated tables are also slow.

Tests

union isfuzzy=true
    (Heartbeat | where TimeGenerated > ago(1h) | summarize Count=count() | extend TableName="Heartbeat"),
    (InsightsMetrics | where TimeGenerated > ago(1h) | summarize Count=count() | extend TableName="InsightsMetrics"),
    (ContainerLogV2 | where TimeGenerated > ago(1h) | summarize Count=count() | extend TableName="ContainerLogV2")
| project TableName, Count

If only ContainerLogV2 queries are slow and high-volume, table-specific volume or schema complexity is likely.

Hypothesis H4: Workspace or service-side health issue exists¶

Proves if: Multiple narrow queries across unrelated tables are slow for several users, and service-health or Azure Monitor status indicates an incident.

Disproves if: Control queries complete quickly and only one complex query fails.

Tests

az monitor log-analytics query \
    --workspace "$WORKSPACE_ID" \
    --analytics-query "Usage | where TimeGenerated > ago(30m) | summarize Rows=count()" \
    --timespan "PT30M"

az servicehealth events list \
    --query "[?contains(summary, 'Azure Monitor') || contains(summary, 'Log Analytics')].{summary:summary,status:status}"

Consistent slowness for trivial control queries plus service-health evidence is the strongest H4 signal.

Hypothesis H5: Workbook, alert-rule, or cross-workspace design adds overhead¶

Proves if: The same logic is notably faster when tested directly in one workspace or outside workbook parameters.

Disproves if: A simplified single-workspace replay is equally slow.

Tests

union workspace("<workspace-a>")
    (AppRequests | where TimeGenerated > ago(30m) | summarize Requests=sum(ItemCount) by AppRoleName)

AppRequests
| where TimeGenerated > ago(30m)
| summarize Requests = sum(ItemCount) by AppRoleName

If the direct single-workspace query is healthy, focus on workbook scope, parameters, and federation design rather than base-table performance.

Hypothesis H6: Weak predicates or high-cardinality grouping is the bottleneck¶

Proves if: Replacing search / contains with selective predicates or grouping by a coarser field materially improves runtime.

Disproves if: Predicate and grouping changes have little effect.

Tests

AppRequests
| where TimeGenerated > ago(24h)
| summarize Requests = sum(ItemCount) by Url
| take 20

AppRequests
| where TimeGenerated > ago(24h)
| extend Route = tostring(split(Url, "?")[0])
| summarize Requests = sum(ItemCount) by Route
| take 20

If the coarse version is much faster and more useful, high-cardinality design was the problem.

7. Likely Root Cause Patterns¶

Pattern	Evidence	Resolution
Broad time range on high-volume tables	`Usage` shows high volume; 1-hour query is fast but 24-hour query is slow	Reduce default workbook and alert windows; pre-aggregate where possible.
Late filtering after `join`, `parse`, or `mv-expand`	Reordered query is much faster	Move selective `where` clauses and `project` before expensive operators.
`AzureDiagnostics` or dynamic-column overuse	Wide schema with few category filters and lots of string parsing	Filter by `ResourceProvider`, `Category`, and explicit columns before parsing.
`search *` or weak `contains` predicates	Correct answers but excessive scan cost and timeout risk	Replace with `has`, equality, or table-specific filters whenever valid.
High-cardinality summarize or join	Grouping by `OperationId`, full URLs, IPs, or request IDs produces huge intermediate sets	Aggregate at a coarser business dimension and drill deeper only on demand.
Workbook or cross-workspace overhead	Raw KQL is fast in Logs but slow in workbook or federated execution	Simplify parameters, reduce workspaces, and separate exploratory from operational queries.
Mistaken workspace-health diagnosis	Narrow control queries are healthy and service health is clean	Focus on KQL design rather than quota, retention, or incident response.

Normal vs Abnormal Comparison¶

Metric/Log	Normal State	Abnormal State	Threshold
Data scanned	Proportional to time range and filters	Very high even for small result sets	Multi-GB scan for tiny output
Query order	Time filter and narrow predicates appear early	`join`, `parse`, or `extend` happen before filtering	Missing early filters
Table usage	Stable ingestion profile	Sudden ingestion burst in target table	Clear usage spike
Predicate selectivity	`has`, equality, and targeted fields	`search *`, broad `contains`, wide `or` chains	Weak text search
Grouping cardinality	`summarize` uses service or route-level dimensions	`summarize` uses near-unique keys	Near-unique grouping
Simple control queries	Finish quickly	Also slow or timing out	Several seconds to minutes

8. Immediate Mitigations¶

Reduce the default query time range in dashboards, workbooks, and ad hoc triage to the smallest useful window.
Move where TimeGenerated and resource-specific filters to the top of the query.
Add project early so large dynamic or unused columns are not carried through the pipeline.
Replace contains with more selective operators such as has or exact comparisons where the data shape allows it.
Break large joins into staged summaries instead of correlating raw events across both tables.
Split exploratory workbook KQL from production alert KQL so repeated execution uses lighter purpose-built queries.
If one large table dominates, address ingestion shape as well as query shape.

9. Prevention¶

Review saved KQL used by workbooks and alerts quarterly, especially after onboarding services that increase ingestion volume.
Establish authoring rules: time filter first, project early, summarize before join, and avoid broad regex or search * on hot tables.
Monitor Usage so teams know when formerly reasonable default windows become too expensive.
Keep table-specific query snippets for heavy sources such as AKS logs, Application Insights traces, AppRequests, and AzureDiagnostics instead of relying on broad generic searches.
Teach operators to compare a slow production query with a small-window control query before escalating to workspace or service-health investigations.

Prevention checklist¶

Validate every new workbook query against both a 15-minute and a 24-hour window.
Require reviewers to inspect for early where TimeGenerated, early project, and constrained joins.
Keep a short list of approved query patterns for high-volume tables used in production dashboards.
Revisit retention, table plans, and ingestion filtering when teams repeatedly need wide historical searches on verbose tables.

Slow Query Performance¶

1. Summary¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

5. Evidence to Collect¶

5.1 KQL Queries¶

Query 1: Table volume baseline¶

Query 2: Expensive predicate baseline¶

Query 3: Narrow projection and selective operator pattern¶

5.2 CLI Investigation¶

Command 1: Check workspace configuration and query access state¶

Command 2: Run a narrow control query from CLI¶

Command 3: Check recent table usage trend¶

Command 4: Inspect the scheduled query rule if the slow KQL powers an alert¶

6. Validation and Disproof by Hypothesis¶

Hypothesis H1: The query scans too much data¶

Hypothesis H2: The KQL shape is inefficient¶

Hypothesis H3: One table has unexpectedly high volume or expensive schema¶

Hypothesis H4: Workspace or service-side health issue exists¶

Hypothesis H5: Workbook, alert-rule, or cross-workspace design adds overhead¶

Hypothesis H6: Weak predicates or high-cardinality grouping is the bottleneck¶

7. Likely Root Cause Patterns¶

Normal vs Abnormal Comparison¶

8. Immediate Mitigations¶

9. Prevention¶

Prevention checklist¶

See Also¶

Sources¶