Jobs Operations¶
This guide covers day-2 operations for Container Apps Jobs: listing executions, drilling into failures, replaying work, and tracking job health over time.
Prerequisites¶
- Azure Container Apps environment and Job already deployed
- Azure CLI and Container Apps extension available in the operator workstation
- Log Analytics workspace connected to the environment for longer-term history
export RG="rg-aca-prod"
export JOB_NAME="job-orders-reconcile"
export EXECUTION_NAME="job-orders-reconcile-abc123"
export WORKSPACE_ID="<log-analytics-workspace-id>"
When to Use¶
Use this runbook when you need to:
- inspect the most recent Job executions
- stop a bad execution
- replay a failed run
- answer whether success rate, duration, or retry behavior is drifting
Procedure¶
1. List recent executions¶
2. Inspect a specific execution¶
az containerapp job execution show \
--name "$JOB_NAME" \
--resource-group "$RG" \
--job-execution-name "$EXECUTION_NAME" \
--output json
3. Stop an in-flight execution when needed¶
az containerapp job execution stop \
--name "$JOB_NAME" \
--resource-group "$RG" \
--job-execution-name "$EXECUTION_NAME"
Confirm CLI command availability against your installed extension
The execution list, show, and stop patterns above reflect the expected long-form command group for current Container Apps Jobs operations. Verify them against the Container Apps extension version you run in production before codifying them in automation.
4. Replay a failed execution manually¶
Replay starts a new execution from the same job definition.
Before replaying, confirm whether you also need to:
- requeue or unlock an input item
- clean up partial output from the failed run
- reduce parallelism or retries for the replay window
5. Query logs for a job or execution¶
When your workspace schema is known, filter directly on the job and execution fields. If schema differs across workspaces, use a defensive query that tolerates different column names.
let TargetJob = "job-orders-reconcile";
let TargetExecution = "job-orders-reconcile-abc123";
ContainerAppSystemLogs_CL
| extend JobName = tostring(column_ifexists("JobName_s", column_ifexists("ContainerAppName_s", "")))
| extend ExecutionName = tostring(column_ifexists("ExecutionName_s", column_ifexists("ExecutionId_g", "")))
| where JobName == TargetJob
| where isempty(TargetExecution) or ExecutionName == TargetExecution
| project TimeGenerated, JobName, ExecutionName, Reason=tostring(column_ifexists("Reason_s", "")), Log=tostring(column_ifexists("Log_s", ""))
| order by TimeGenerated desc
Exact Log Analytics column names vary by workspace schema
Existing repository KQL examples use JobName_s and ExecutionName_s in ContainerAppSystemLogs_CL. Re-check the actual columns in your workspace before you build dashboards or alerts around a fixed schema.
6. Track success rate, duration, and retry activity¶
Success and failure trend:
ContainerAppSystemLogs_CL
| extend JobName = tostring(column_ifexists("JobName_s", column_ifexists("ContainerAppName_s", "")))
| extend Reason = tostring(column_ifexists("Reason_s", ""))
| where JobName == "job-orders-reconcile"
| where Reason in ("Completed", "Failed")
| summarize Executions=count() by Reason, bin(TimeGenerated, 1h)
| order by TimeGenerated asc
Retry activity trend:
ContainerAppSystemLogs_CL
| extend JobName = tostring(column_ifexists("JobName_s", column_ifexists("ContainerAppName_s", "")))
| extend Reason = tostring(column_ifexists("Reason_s", ""))
| where JobName == "job-orders-reconcile"
| where Reason has "Retry"
| summarize RetryEvents=count() by bin(TimeGenerated, 1h)
| order by TimeGenerated asc
Duration example from structured application logs:
ContainerAppConsoleLogs_CL
| extend Payload = parse_json(Log_s)
| extend ExecutionName = tostring(Payload.execution_name)
| extend DurationMs = todouble(Payload.duration_ms)
| where tostring(Payload.message) == "Job execution completed"
| summarize P50Ms=percentile(DurationMs, 50), P95Ms=percentile(DurationMs, 95), MaxMs=max(DurationMs) by bin(TimeGenerated, 1h)
| order by TimeGenerated asc
Verification¶
Use the control loop below after any replay or stop action.
flowchart TD
A[List executions] --> B[Inspect failed or long-running execution]
B --> C{Stop or replay needed?}
C -->|Stop| D[Stop execution]
C -->|Replay| E[Start new execution]
D --> F[Query logs and metrics]
E --> F
F --> G[Confirm success rate and duration return to baseline] Basic verification commands:
Rollback / Troubleshooting¶
- If a replay starts reprocessing bad input, stop it and quarantine the input item.
- If failures are data-dependent, reduce retries and use the dead-letter path instead of repeated replay.
- If logs are insufficient, update the job image to emit explicit execution correlation fields before the next incident.
Use Jobs Troubleshooting for symptom-based triage.