Event Job Storm¶

Use this playbook when an event-driven job creates too many executions, drains a queue unexpectedly fast, or amplifies downstream failures.

Symptom¶

A queue spike results in far more job executions than expected.
Downstream dependencies throttle because too many job executions start concurrently.
Retries or poison messages keep re-triggering work.
Operators see a sudden burst in az containerapp job execution list even though the backlog was small.

flowchart TD
    A[Execution count surges] --> B[Review trigger type and scale metadata]
    B --> C[Inspect execution history]
    C --> D{Backlog legitimately large}
    D -->|Yes| E[Tune max executions and downstream capacity]
    D -->|No| F[Check duplicate or poison messages]
    F --> G{Messages are reappearing}
    G -->|Yes| H[Fix consumer completion or dead-letter policy]
    G -->|No| I[Reduce trigger aggressiveness]
    E --> J[Re-run with bounded concurrency]
    H --> J
    I --> J

Possible Causes¶

The event trigger metadata is too aggressive for the queue or downstream dependency.
Maximum parallel executions are higher than the workload can safely absorb.
Messages are retried repeatedly because the consumer fails before acknowledging completion.
A Dapr or queue component points to the wrong source, causing duplicate delivery patterns.
Operators are measuring execution count instead of unique business work items.

Diagnosis Steps¶

Confirm the job really is event-driven.
Compare execution history with the actual backlog and dead-letter counts.
Check whether the same messages are being retried or re-enqueued.
Bound the concurrency hypothesis by reviewing the configured max execution behavior.

az containerapp job show \
    --name "$JOB_NAME" \
    --resource-group "$RG" \
    --output json

az containerapp job execution list \
    --name "$JOB_NAME" \
    --resource-group "$RG" \
    --output table

az containerapp job list \
    --resource-group "$RG" \
    --environment "$CONTAINER_ENV" \
    --output table

Command	Why it is used
`az containerapp job show --name "$JOB_NAME" --resource-group "$RG" --output json`	Confirms the job trigger model and lets you inspect queue-related metadata in the applied definition.
`az containerapp job execution list --name "$JOB_NAME" --resource-group "$RG" --output table`	Shows how many executions were launched and whether they cluster in short time windows.
`az containerapp job list --resource-group "$RG" --environment "$CONTAINER_ENV" --output table`	Verifies you are looking at the correct job in the intended Container Apps environment.

KQL to correlate burst timing:

let JobName = "job-myapp";
ContainerAppSystemLogs_CL
| where TimeGenerated > ago(6h)
| where JobName_s == JobName or ContainerAppName_s == JobName
| where Log_s has_any ("Execution", "Started", "Completed", "Failed")
| summarize Executions=count() by bin(TimeGenerated, 5m), Reason_s
| order by TimeGenerated asc

Resolution¶

Reduce event-trigger aggressiveness and cap safe concurrency in the job definition.
Fix duplicate-delivery patterns before increasing throughput.
If poison messages are present, route them away from the hot path and retry them separately.
Validate the tuned configuration against a controlled backlog instead of production surge traffic.

az containerapp job show \
    --name "$JOB_NAME" \
    --resource-group "$RG" \
    --output yaml

Command	Why it is used
`az containerapp job show --name "$JOB_NAME" --resource-group "$RG" --output yaml`	Provides the current job definition so you can reapply a safer event trigger and concurrency envelope through YAML or IaC.

Prevention¶

Define explicit upper bounds for job concurrency.
Dead-letter poison messages instead of allowing infinite business retries.
Load-test event jobs with realistic backlog shapes.
Document the mapping between queue depth, execution fan-out, and downstream capacity.

Event Job Storm¶

Symptom¶

Possible Causes¶

Diagnosis Steps¶

Resolution¶

Prevention¶

See Also¶

Sources¶