Deployment Failures¶
1. Summary¶
Deployment failures in Azure Functions are often deployment-success-but-runtime-failure incidents. Transport success (artifact uploaded) does not guarantee host startup, trigger indexing, or invocation health. This playbook prioritizes fast triage of runtime settings, deployment model fit, and package integrity.
flowchart TD
A[Deployment failed or degraded after release] --> B{App starts?}
B -->|No| C[Check startup traces + activity log]
B -->|Yes but errors| D["Check runtime/app settings drift"]
C --> E{Recent deploy or config event?}
E -->|Yes| F[Validate deploy method + artifact]
E -->|No| G["Check platform/runtime changes"]
D --> H{FUNCTIONS_WORKER_RUNTIME correct?}
H -->|No| I[Fix runtime setting + restart]
H -->|Yes| J{linuxFxVersion aligned?}
J -->|No| K[Align stack and runtime]
J -->|Yes| L{Required settings present?}
L -->|No| M[Restore settings and slot config]
L -->|Yes| N{Method fits hosting model?}
N -->|No| O["Switch method: zip deploy / run-from-package"]
N -->|Yes| P[Check package integrity + redeploy]
F --> P
G --> J
P --> Q[Validate host stability and failure rate] Incident framing¶
- Common symptom: deployment command succeeds, but app returns
500,404, or host loops on startup. - Deployment concepts in scope: zip deploy, run-from-package, deployment slots, extension bundles,
FUNCTIONS_WORKER_RUNTIME,linuxFxVersion. - Use objective evidence first: traces, app settings snapshot, and activity log in the deployment window.
Variables¶
RG="rg-functions-prod"
APP_NAME="func-orders-prod"
SUBSCRIPTION_ID="<subscription-id>"
WORKSPACE_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
2. Common Misreadings¶
- "Deployment succeeded, so app is healthy"
- Upload success does not prove host startup success.
- Always verify startup traces and first invocation outcomes.
- "404 after release is only routing"
- It can also indicate indexing failure or missing function metadata.
- "Staging is healthy, swap is safe"
- Slot-specific settings drift can break production immediately after swap.
- "Any artifact can run with current stack"
FUNCTIONS_WORKER_RUNTIMEandlinuxFxVersionmust align with artifact language/runtime.
- "Extension bundle issues are rare and harmless"
- Bundle retrieval/version issues can prevent trigger listeners from loading.
3. Competing Hypotheses¶
H1: Runtime mismatch¶
- Artifact language/runtime does not match app runtime configuration.
- Typical examples:
- Python package +
FUNCTIONS_WORKER_RUNTIME=node - Node package + unsupported Linux stack image
- .NET isolated package with incompatible host/runtime assumptions
- Python package +
H2: Missing required app settings¶
- Required settings are missing, empty, stale, or not sticky in slot swaps.
- Commonly implicated settings:
AzureWebJobsStorageFUNCTIONS_WORKER_RUNTIMEWEBSITE_RUN_FROM_PACKAGE- trigger connection setting keys referenced in bindings
H3: Deployment method not suited to hosting model¶
- Deployment strategy conflicts with plan/runtime behavior.
- Frequent patterns:
- mutable copy style deployment where package mount is expected
- slot swap without settings stickiness policy
- zip deploy sequence without verification of final mount/index state
H4: Artifact corruption or incomplete upload¶
- Uploaded package is incomplete/corrupted or CI artifact differs from deployed payload.
- Typical indicators:
- missing
host.jsonat package root - missing/transient dependency folders
- package hash mismatch between CI and deployment metadata
- missing
4. What to Check First¶
First 5-minute checklist¶
- Identify the exact release path: zip deploy, run-from-package URL, or slot swap.
- Snapshot app metadata (
state,kind,linuxFxVersion) and key settings. - Inspect activity log around deployment timestamp for failed config writes.
- Check startup/shutdown traces for crash loop behavior.
- If slots are used, diff staging vs production settings.
Metadata snapshot command¶
az functionapp show \
--resource-group $RG \
--name $APP_NAME \
--subscription $SUBSCRIPTION_ID \
--query "{state:state, kind:kind, reserved:reserved, linuxFxVersion:siteConfig.linuxFxVersion}" \
--output json
state should settle at Running. - linuxFxVersion should match intended runtime. - Linux app should report reserved: true. Slot-focused quick checks¶
- confirm sticky settings policy for production-critical values
- confirm
WEBSITE_RUN_FROM_PACKAGEvalue validity per slot - confirm connection settings required by bindings exist in target slot
5. Evidence to Collect¶
Sample Log Patterns¶
2026-04-04T07:15:02.214Z info Starting Host (HostId=func-orders-prod, InstanceId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
2026-04-04T07:15:02.921Z info Loading functions metadata
2026-04-04T07:15:03.104Z warn WorkerConfig for runtime:python not found; configured runtime is node
2026-04-04T07:15:03.370Z error A host error has occurred during startup operation 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
2026-04-04T07:15:03.371Z error Microsoft.Azure.WebJobs.Script: Did not find functions with language [python]
2026-04-04T07:15:08.122Z info Stopping JobHost
2026-04-04T07:15:08.601Z info Host started (4210ms)
2026-04-04T07:15:10.019Z warn Failed to load extension bundle from [https://functionscdn.azureedge.net/public/ExtensionBundles/...]
2026-04-04T07:15:12.443Z error Executed 'Functions.HttpTrigger' (Failed, Id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, Duration=12ms)
KQL Queries with Example Output¶
Query 1: Function execution summary (KQL library query 1)¶
let appName = "func-orders-prod";
requests
| where timestamp > ago(1h)
| where cloud_RoleName =~ appName
| where operation_Name startswith "Functions."
| summarize
Invocations = count(),
Failures = countif(success == false),
FailureRatePercent = round(100.0 * countif(success == false) / count(), 2),
P95Ms = percentile(duration, 95)
by FunctionName = operation_Name
| order by Failures desc, P95Ms desc
Query 2: Exceptions concentrated after deployment¶
let appName = "func-orders-prod";
exceptions
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| summarize Count=count() by type, outerMessage
| order by Count desc
Query 3: Host startup/shutdown timeline (adapted from KQL library query 8)¶
let appName = "func-orders-prod";
traces
| where timestamp > ago(12h)
| where cloud_RoleName =~ appName
| where message has_any ("Host started", "Job host started", "Host shutdown", "Host is shutting down", "Stopping JobHost")
| project timestamp, severityLevel, message
| order by timestamp desc
CLI Investigation Commands¶
Runtime and host metadata¶
az functionapp show \
--resource-group $RG \
--name $APP_NAME \
--subscription $SUBSCRIPTION_ID \
--output json
{
"name": "func-orders-prod",
"state": "Running",
"kind": "functionapp,linux",
"siteConfig": {
"linuxFxVersion": "PYTHON|3.11"
}
}
App settings inspection¶
az functionapp config appsettings list \
--resource-group $RG \
--name $APP_NAME \
--subscription $SUBSCRIPTION_ID \
--output table
Name Value
----------------------------- -----------------------------------------
FUNCTIONS_WORKER_RUNTIME python
AzureWebJobsStorage DefaultEndpointsProtocol=https;...
WEBSITE_RUN_FROM_PACKAGE 1
APPINSIGHTS_INSTRUMENTATIONKEY xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Control plane event history¶
az monitor activity-log list \
--subscription $SUBSCRIPTION_ID \
--resource-group $RG \
--offset 2h \
--status Failed \
--output table
EventTimestamp OperationNameValue Status ResourceGroup
-------------------------- ------------------------------------- ------ -----------------
2026-04-04T07:14:51.000000Z Microsoft.Web/sites/config/write Failed rg-functions-prod
2026-04-04T07:14:30.000000Z Microsoft.Web/sites/publishxml/action Failed rg-functions-prod
Normal vs Abnormal Comparison¶
How to Read This
Language mismatch between linuxFxVersion and FUNCTIONS_WORKER_RUNTIME is a high-confidence root cause. It typically causes broad failures across functions, not an isolated endpoint issue.
H2: Missing required app settings¶
Signals that support¶
- startup/invocation errors mention missing or null configuration values
- storage/auth initialization fails after deployment
- swap happened and slot settings differ
Signals that weaken¶
- required settings exist with non-empty values in affected slot
- restoring known-good settings does not change behavior
- incident predates any settings change
What to verify¶
KQL:
let appName = "func-orders-prod";
exceptions
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where outerMessage has_any ("Missing", "not configured", "AzureWebJobsStorage", "Value cannot be null")
| project timestamp, type, outerMessage
| order by timestamp desc
az functionapp config appsettings list \
--resource-group $RG \
--name $APP_NAME \
--subscription $SUBSCRIPTION_ID \
--query "[?name=='AzureWebJobsStorage' || name=='FUNCTIONS_WORKER_RUNTIME' || name=='WEBSITE_RUN_FROM_PACKAGE']" \
--output table
Name Value
------------------------ -----
FUNCTIONS_WORKER_RUNTIME python
WEBSITE_RUN_FROM_PACKAGE 1
AzureWebJobsStorage
How to Read This
Empty AzureWebJobsStorage is usually enough to block startup or trigger initialization. When this appears after swap, check sticky settings and release variable scope first.
H3: Deployment method not suited to hosting model¶
Signals that support¶
- host behavior becomes unstable immediately after method change
- swap succeeds but production degrades while staging looked healthy
- package mount semantics do not match deployment approach
Signals that weaken¶
- method is stable historically on same plan and runtime
- same artifact + same method redeploy works repeatedly
- no settings drift between staging and production
What to verify¶
KQL:
let appName = "func-orders-prod";
traces
| where timestamp > ago(3h)
| where cloud_RoleName =~ appName
| where message has_any ("Host started", "Host shutdown", "Stopping JobHost", "Run-From-Package", "ZipDeploy")
| project timestamp, severityLevel, message
| order by timestamp desc
az functionapp config appsettings list \
--resource-group $RG \
--name $APP_NAME \
--subscription $SUBSCRIPTION_ID \
--query "[?name=='WEBSITE_RUN_FROM_PACKAGE']" \
--output json
az monitor activity-log list \
--subscription $SUBSCRIPTION_ID \
--resource-group $RG \
--offset 3h \
--output table
Name Value
----------------------- ---------------------------------------------
WEBSITE_RUN_FROM_PACKAGE https://<storage-account>.blob.core.windows.net/deployments/app.zip?<sas>
EventTimestamp OperationNameValue Status
-------------------------- ----------------------------------------- -------
2026-04-04T07:14:30.000000Z Microsoft.Web/sites/slots/slotsswap/action Succeeded
2026-04-04T07:13:58.000000Z Microsoft.Web/sites/config/write Succeeded
How to Read This
If swap succeeded but startup loops continue, compare slot-specific settings and package URL accessibility. If deploy method changed recently, rollback method first before deeper code rollback.
H4: Artifact corruption or incomplete upload¶
Signals that support¶
- function count drops or expected functions disappear after release
- indexing/load errors appear despite correct runtime configuration
- previous artifact rollback restores service quickly
Signals that weaken¶
- artifact hash matches CI and package inspection is clean
- same artifact works in another environment
- failure pattern points to downstream dependency outage
What to verify¶
KQL:
let appName = "func-orders-prod";
traces
| where timestamp > ago(2h)
| where cloud_RoleName =~ appName
| where message has_any ("Loading functions metadata", "Error indexing method", "No job functions found", "Failed to load")
| project timestamp, severityLevel, message
| order by timestamp desc
az monitor activity-log list \
--subscription $SUBSCRIPTION_ID \
--resource-group $RG \
--offset 2h \
--status Failed \
--output table
az functionapp show \
--resource-group $RG \
--name $APP_NAME \
--subscription $SUBSCRIPTION_ID \
--query "{state:state, defaultHostName:defaultHostName}" \
--output json
EventTimestamp OperationNameValue Status SubStatus
-------------------------- ------------------------------------- ------ ---------
2026-04-04T07:14:29.000000Z Microsoft.Web/sites/publishxml/action Failed Conflict
{
"state": "Running",
"defaultHostName": "func-orders-prod.azurewebsites.net"
}
How to Read This
state: Running is not proof of package integrity. If indexing errors disappear after redeploying prior artifact, classify as package corruption/incomplete upload.
7. Likely Root Cause Patterns¶
- Runtime/config drift at release time
- stack changed (
linuxFxVersion) without matching worker setting - worker setting changed without matching build target
- stack changed (
- Slot configuration drift
- non-sticky settings differ between slots
- run-from-package URL/token valid only in staging
- Extension bundle compatibility or retrieval failure
- host starts but specific triggers do not initialize
- bundle version range incompatible with app/runtime expectation
- Artifact assembly/publish defect
- missing root files (
host.json) or incomplete dependencies - CI artifact and deployed package hash differ
- missing root files (
8. Immediate Mitigations¶
- Roll back to last known-good artifact and release method.
- Restore known-good app settings snapshot, then restart host.
- If slots are used, swap back immediately.
- Align runtime settings and stack:
- set
FUNCTIONS_WORKER_RUNTIMEcorrectly - align
linuxFxVersionwith target runtime
- set
- Redeploy with validated package hash and immutable artifact source.
- If extension bundle errors exist, pin known-good bundle range and redeploy. Operational command examples:
9. Prevention¶
- Add CI/CD guardrails:
- validate
FUNCTIONS_WORKER_RUNTIME,linuxFxVersion, and artifact language compatibility - fail pipeline if required settings are missing
- validate
- Use immutable deployment controls:
- run-from-package with checksum verification
- record artifact checksum and release metadata
- Improve slot governance:
- explicit sticky-setting policy
- automated pre-swap settings diff check
- Add post-deploy automated health gates:
- startup trace validation in first 10 minutes
- automatic rollback threshold on failure rate and startup loops
- Manage extension bundle/runtime lifecycle intentionally:
- pin supported extension bundle ranges
- keep version compatibility matrix by app and plan
See Also¶
- First 10 Minutes
- Methodology
- KQL Query Library
- Deployment and release patterns
- Related Labs: Cold Start Lab