Container App Job Execution Failure¶
1. Summary¶
Symptom¶
- Job execution state is
FailedorTimedOut. - Scheduled jobs skip expected run windows.
- Console logs are empty or show startup/auth errors.
Why this scenario is confusing¶
Container Apps Jobs can fail before the workload does useful work, fail near the timeout boundary, or retry without ever fixing the real dependency problem. It is easy to blame the cron trigger or the job platform when the actual issue is trigger metadata, timeout policy, or missing configuration.
Troubleshooting decision flow¶
flowchart TD
A[Job does not complete successfully] --> B{Execution created?}
B -->|No| C[Fix trigger schedule or event configuration]
B -->|Yes| D{Timed out before workload end?}
D -->|Yes| E[Increase timeout and tune retry policy]
D -->|No| F[Fix runtime errors, secrets, or identity access]
C --> G[Re-run job and verify success]
E --> G
F --> G 2. Common Misreadings¶
- "Cron trigger is broken." Invalid schedule timezone or overlapping run constraints are common.
- "Retries mean eventual success." Repeated retries can amplify downstream failures.
3. Competing Hypotheses¶
| Hypothesis | Typical Evidence For | Typical Evidence Against |
|---|---|---|
| H1: Trigger configuration incorrect | No executions for expected schedule/event | Manual execution succeeds with same image |
| H2: Timeout too low | Execution stops near timeout boundary | Runtime completes within configured timeout |
| H3: Missing secret or environment dependency | Job logs show auth/config failures | All env and secret checks pass |
4. What to Check First¶
Metrics¶
- Job execution success ratio and retry count over time.
Logs¶
let AppName = "job-myapp";
ContainerAppSystemLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("job", "execution", "timeout", "retry", "failed")
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated desc
Platform Signals¶
az containerapp job execution list --name "$APP_NAME" --resource-group "$RG" --output table
az containerapp job show --name "$APP_NAME" --resource-group "$RG" --output json
5. Evidence to Collect¶
Required Evidence¶
| Evidence | Command/Query | Purpose |
|---|---|---|
| Execution list | az containerapp job execution list --name "$APP_NAME" --resource-group "$RG" --output table | Confirm whether executions are being created and how they end |
| Job definition | az containerapp job show --name "$APP_NAME" --resource-group "$RG" --output json | Inspect trigger type, timeout, retry, and configuration |
| Execution details | az containerapp job execution show --name "$APP_NAME" --resource-group "$RG" --job-execution-name "<execution-name>" --output json | Capture detailed state for a failed or timed out run |
| System logs | az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type system | Check platform-side execution, timeout, and retry signals |
| Console logs | az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type console | Check workload-side startup, auth, or configuration failures |
| Job lifecycle KQL | KQL on ContainerAppSystemLogs_CL | Correlate execution events and failures over time |
Useful Context¶
- Trigger type (manual, schedule, event)
- Expected runtime versus configured timeout
- Retry policy and downstream side effects
- Recent changes to secret references, identity permissions, or trigger metadata
Observed successful job lifecycle sequence (baseline):
SuccessfulCreate → Successfully created pod for Job Execution
AssigningReplica → Replica scheduled to run on a node
PullingImage → Pulling image '<acr-name>.azurecr.io/myapp-job:v1.0.0'
PulledImage → Successfully pulled image in 2.42s (58720256 bytes)
ContainerCreated → Created container 'job-container'
ContainerStarted → Started container 'job-container'
ContainerTerminated → Container terminated with exit code '0'
Completed → Execution has successfully completed
PodDeletion → Pod exited with status Succeeded
6. Validation and Disproof by Hypothesis¶
H1: Trigger configuration incorrect¶
Signals that support:
- No executions for the expected schedule or event.
- Manual execution succeeds with the same image.
- Trigger schedule or event metadata does not align with the expected run window.
Signals that weaken:
- Executions are created on schedule.
- The same trigger metadata has been stable and successful.
- Failures occur after the execution starts rather than before creation.
What to verify:
az containerapp job execution list --name "$APP_NAME" --resource-group "$RG" --output table
az containerapp job show --name "$APP_NAME" --resource-group "$RG" --output json
Disproof logic: If executions are being created at the expected times and the same workload still fails after startup, the trigger configuration is not the primary fault domain.
H2: Timeout too low¶
Signals that support:
- Execution stops near the timeout boundary.
- Workload consistently needs longer than the configured timeout.
- Retries restart the same long-running work without completion.
Signals that weaken:
- Runtime completes well within the configured timeout.
- Failures happen immediately at startup.
- Logs show configuration or authentication errors before useful work begins.
What to verify:
az containerapp job execution show --name "$APP_NAME" --resource-group "$RG" --job-execution-name "<execution-name>" --output json
az containerapp job show --name "$APP_NAME" --resource-group "$RG" --output json
let AppName = "job-myapp";
ContainerAppSystemLogs_CL
| where ContainerAppName_s == AppName
| where Log_s has_any ("job", "execution", "timeout", "retry", "failed")
| project TimeGenerated, RevisionName_s, Log_s
| order by TimeGenerated desc
Disproof logic: If failed executions terminate far earlier than the timeout or fail for a clear runtime reason, increasing timeout will not resolve the incident.
H3: Missing secret or environment dependency¶
Signals that support:
- Job logs show auth or configuration failures.
- Console logs are empty or fail immediately after container startup.
- Secret references, env values, or identity access changed recently.
Signals that weaken:
- All env and secret checks pass.
- Manual and scheduled executions succeed with the same configuration.
- No auth or config failures appear in logs.
What to verify:
az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type system
az containerapp logs show --name "$APP_NAME" --resource-group "$RG" --type console
az containerapp job show --name "$APP_NAME" --resource-group "$RG" --output json
Disproof logic: If secrets, environment values, and identity permissions are correct and the workload still fails only near the timeout boundary, focus on timeout or trigger behavior instead.
7. Likely Root Cause Patterns¶
| Pattern | Frequency | First Signal | Typical Resolution |
|---|---|---|---|
| Trigger metadata drift | Common | No execution created when expected | Fix schedule, event configuration, or run constraints |
| Timeout set below real runtime | Common | Execution ends near timeout boundary | Increase timeout and tune retry policy |
| Missing secret or env reference | Common | Console logs show auth/config failure | Correct secret references and configuration |
| Identity permission issue | Occasional | Startup/auth failure in job logs | Fix managed identity access |
| Retry storm against broken dependency | Occasional | Repeated retries without useful work | Repair dependency and make retries safer |
8. Immediate Mitigations¶
- Validate trigger type (manual, schedule, event) and metadata.
- Increase timeout and set retry policy for expected runtime.
- Correct missing configuration, secret references, and identity permissions.
- Re-run execution and verify completion with expected output.
9. Prevention¶
- Add dry-run validation for job inputs and trigger config.
- Track job SLOs and alert on retry bursts.
- Make job handlers idempotent for safe retries.