Lab Guide: Slot Swap Config Drift (Sticky vs Non-Sticky Settings)¶
This Level 3 lab guide reproduces a slot swap on Azure App Service Linux and proves, using real artifacts, how configuration behaves across swap boundaries. The lab shows why teams often perceive "drift" after a healthy swap.
Lab Metadata¶
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Estimated Duration | 45-60 minutes |
| Tier | Standard |
| Failure Mode | Apparent post-swap config drift caused by misunderstanding sticky versus non-sticky app settings |
| Skills Practiced | Deployment slot troubleshooting, sticky-setting validation, pre/post swap comparison, lifecycle log interpretation |
Lab outcome summary
The artifacts show an exact textbook result:
FEATURE_FLAG(non-sticky) swapped with the app context.DB_CONNECTION_STRING(sticky) stayed with each slot.
This is expected App Service slot behavior, not accidental misconfiguration.
1) Background¶
Slot swap incidents are often not platform failures. They are frequently interpretation failures.
To troubleshoot accurately, you must separate:
- Slot routing behavior.
- Sticky/non-sticky config behavior.
- Process restart and warm-up lifecycle behavior.
1.1 Deployment slots are separate runtime surfaces¶
Each slot is a separate runtime endpoint with its own hostname and runtime lifecycle.
flowchart TD
subgraph AppService[Single App Service App]
P[Production Slot\nHostname: app.azurewebsites.net]
S[Staging Slot\nHostname: app-staging.azurewebsites.net]
end
U[Users] --> P
O[Operators/Testers] --> S Even when slots share plan resources, they maintain independent runtime state and per-slot app setting context.
1.2 Swap behavior model¶
Swap can be thought of as role exchange with warm-up safeguards, not just a file copy.
sequenceDiagram
participant Ops as Operator
participant Platform as App Service Platform
participant Stage as Staging Slot
participant Prod as Production Slot
Ops->>Platform: Start swap (staging -> production)
Platform->>Stage: Validate and warm source runtime
Stage-->>Platform: Warm-up probe succeeded
Platform->>Prod: Shift production role
Platform->>Stage: Shift staging role
Platform-->>Ops: Swap completed 1.3 Sticky and non-sticky settings¶
The critical rule set:
| Setting class | During swap | Operational impact |
|---|---|---|
Non-sticky (slotSetting=false) | Moves with swapped app context | Looks like value "changed" on production |
Sticky (slotSetting=true) | Stays attached to physical slot | Looks like value "did not change" despite code swap |
This lab intentionally mixes one of each to produce a clear differential.
1.4 Why this creates perceived "drift"¶
After swap, operators may expect all values to follow code.
If that mental model is wrong:
- Non-sticky setting movement appears surprising.
- Sticky setting stability appears inconsistent.
In reality, both are correct outcomes.
1.5 Template-level implementation in this lab¶
From main.bicep:
- Production and staging are both configured with Python 3.11 runtime.
FEATURE_FLAGdiffers by slot (v1vsv2) and is non-sticky.DB_CONNECTION_STRINGdiffers by slot and is marked sticky viaslotConfigNames.
slotConfigNames excerpt:
resource slotConfigNames 'Microsoft.Web/sites/config@2023-12-01' = {
name: '${webApp.name}/slotConfigNames'
properties: {
appSettingNames: [
'DB_CONNECTION_STRING'
]
connectionStringNames: []
}
}
1.6 Expected before/after state map¶
flowchart TB
subgraph BeforeSwap[Before Swap]
P1[Production\nFEATURE_FLAG=v1\nDB_CONNECTION_STRING=prod-server]
S1[Staging\nFEATURE_FLAG=v2\nDB_CONNECTION_STRING=staging-server]
end
subgraph AfterSwap[After Swap]
P2[Production\nFEATURE_FLAG=v2\nDB_CONNECTION_STRING=prod-server]
S2[Staging\nFEATURE_FLAG=v1\nDB_CONNECTION_STRING=staging-server]
end
P1 --> S2
S1 --> P2 1.7 Warm-up lifecycle context¶
Swap operations are tightly coupled with startup/warm-up behavior:
- Container and app startup events occur around swap windows.
- Probe success transitions appear in
AppServicePlatformLogs. - Gunicorn startup appears in
AppServiceConsoleLogs.
This guide includes both config-state proof and lifecycle proof.
1.8 Why this lab matters¶
For production release safety, teams must be able to answer:
- Did the swap perform correctly?
- Did sticky settings stay where intended?
- Did non-sticky settings move as expected?
- Were restarts/warm-up transitions healthy?
This lab provides an evidence-driven method to answer all four.
2) Hypothesis¶
2.1 Formal hypothesis statement¶
When performing a slot swap, non-sticky app settings move with the swapped app context while slot settings (deployment slot settings) remain with the destination slot, creating apparent config drift only for operators who expect all settings to swap uniformly.
2.2 Causal chain¶
flowchart TD
A[Pre-swap config divergence by slot] --> B[Swap staging to production]
B --> C[Non-sticky values exchange]
B --> D[Sticky values remain slot-bound]
C --> E[Production FEATURE_FLAG changes]
D --> F[Production DB setting remains]
E --> G[Perceived drift if sticky model unknown]
F --> G 2.3 Proof criteria¶
The hypothesis is supported only if all conditions are met:
- Pre-swap production and staging values are different for both targeted settings.
- Post-swap
FEATURE_FLAGvalues are exchanged across slots. - Post-swap
DB_CONNECTION_STRINGvalues are unchanged per physical slot. - App settings metadata confirms
DB_CONNECTION_STRINGis sticky (slotSetting=true). - Lifecycle logs show restart/warm-up progression around swap time.
2.4 Disproof criteria¶
Any of these disproves the hypothesis:
FEATURE_FLAGdoes not exchange across slots after confirmed swap.DB_CONNECTION_STRINGexchanges despite sticky metadata.- Sticky metadata absent or inconsistent in pre-swap app settings.
- No lifecycle evidence near swap operation window.
2.5 Controlled and dependent variables¶
| Category | Values |
|---|---|
| Controlled inputs | Same app code package, same runtime, same region, same plan |
| Independent event | Single swap command (staging -> production) |
| Dependent signals | /config JSON, /diag/slots hash/state, process start timestamps, KQL logs |
2.6 Anticipated data pattern¶
| Signal | Expected observation |
|---|---|
FEATURE_FLAG | v1/v2 exchange |
DB_CONNECTION_STRING | prod/staging remains per slot |
| Process start | New startup timestamps near swap |
| Platform logs | Warm-up probe success + started/stopping lifecycle messages |
3) Runbook¶
This runbook is designed for deterministic replay. Commands use long flags only.
3.1 Prerequisites¶
| Requirement | Verification |
|---|---|
| Azure CLI installed | az version |
| Active login | az account show |
| Bash shell | bash --version |
| Python 3 | python3 --version |
3.2 Environment setup¶
Convention note:
- In command usage, this guide references
$RG,$APP_NAME, etc. - In declarations, variable names are written without
$(shell syntax).
3.3 Deploy lab infrastructure¶
az group create \
--name "$RG" \
--location "$LOCATION"
az deployment group create \
--resource-group "$RG" \
--template-file "labs/slot-swap-config-drift/main.bicep" \
--parameters "baseName=$BASE_NAME"
3.4 Resolve app and slot URLs¶
APP_NAME=$(az webapp list \
--resource-group "$RG" \
--query "[0].name" \
--output tsv)
PRODUCTION_HOST=$(az webapp show \
--resource-group "$RG" \
--name "$APP_NAME" \
--query "defaultHostName" \
--output tsv)
STAGING_HOST="$APP_NAME-staging.azurewebsites.net"
PRODUCTION_URL="https://$PRODUCTION_HOST"
STAGING_URL="https://$STAGING_HOST"
3.5 Validate slot settings metadata¶
Inspect production settings:
az webapp config appsettings list \
--resource-group "$RG" \
--name "$APP_NAME" \
--query "[?name=='SCM_DO_BUILD_DURING_DEPLOYMENT' || name=='DB_CONNECTION_STRING' || name=='FEATURE_FLAG'].{name:name,value:value,slotSetting:slotSetting}"
Inspect staging settings:
az webapp config appsettings list \
--resource-group "$RG" \
--name "$APP_NAME" \
--slot "staging" \
--query "[?name=='SCM_DO_BUILD_DURING_DEPLOYMENT' || name=='DB_CONNECTION_STRING' || name=='FEATURE_FLAG'].{name:name,value:value,slotSetting:slotSetting}"
Expected pre-swap model:
| name | production value | staging value | slotSetting |
|---|---|---|---|
| SCM_DO_BUILD_DURING_DEPLOYMENT | true | true | false |
| DB_CONNECTION_STRING | prod-server.database.windows.net | staging-server.database.windows.net | true |
| FEATURE_FLAG | v1 | v2 | false |
Portal view: Configuration blade (slot-aware settings entry point)¶

The Configuration blade is the Portal control plane for the same settings the CLI commands above expose. The Application settings sub-tab (not shown in this capture, which focuses on General settings) is where the per-row Deployment slot setting checkbox appears - that checkbox is the visual representation of the slotSetting:true flag in the expected pre-swap table above (DB_CONNECTION_STRING row). General settings (active here) governs platform behavior like Always on and Session affinity and follows different swap rules than App Settings; consult section 1.3 before changing anything here mid-incident. Use this blade to spot-check the sticky/non-sticky shape of your settings without parsing CLI JSON, then cross-reference the post-swap behavior with the KQL evidence in section 3.10.
3.6 Capture pre-swap runtime evidence¶
curl --silent --show-error "$PRODUCTION_URL/config"
curl --silent --show-error "$STAGING_URL/config"
curl --silent --show-error "$PRODUCTION_URL/diag/slots"
curl --silent --show-error "$STAGING_URL/diag/slots"
curl --silent --show-error "$PRODUCTION_URL/diag/stats"
curl --silent --show-error "$PRODUCTION_URL/diag/env"
Representative pre-swap output:
{"DB_CONNECTION_STRING":"prod-server.database.windows.net","FEATURE_FLAG":"v1","PROCESS_START_UTC":"2026-04-04T05:13:59.196136+00:00","WEBSITE_SLOT_NAME":"Production"}
{"DB_CONNECTION_STRING":"staging-server.database.windows.net","FEATURE_FLAG":"v2","PROCESS_START_UTC":"2026-04-04T05:34:02.102318+00:00","WEBSITE_SLOT_NAME":"Production"}
3.7 Execute swap¶
az webapp deployment slot swap \
--resource-group "$RG" \
--name "$APP_NAME" \
--slot "staging" \
--target-slot "production"
Wait for immediate post-swap stabilization:
3.8 Capture post-swap runtime evidence¶
curl --silent --show-error "$PRODUCTION_URL/config"
curl --silent --show-error "$STAGING_URL/config"
curl --silent --show-error "$PRODUCTION_URL/diag/slots"
curl --silent --show-error "$PRODUCTION_URL/diag/stats"
Representative post-swap output:
{"DB_CONNECTION_STRING":"prod-server.database.windows.net","FEATURE_FLAG":"v2","PROCESS_START_UTC":"2026-04-04T05:46:56.764695+00:00","WEBSITE_SLOT_NAME":"Production"}
{"DB_CONNECTION_STRING":"staging-server.database.windows.net","FEATURE_FLAG":"v1","PROCESS_START_UTC":"2026-04-04T05:49:05.541050+00:00","WEBSITE_SLOT_NAME":"Production"}
3.9 Run full trigger script (optional)¶
The script does:
- Package deployment to both slots.
- Pre-swap config capture.
- Swap execution.
- Post-swap capture.
- Behavior summary booleans:
- feature swapped
- sticky DB remained
- process restart observed
3.10 KQL runbook queries¶
Portal view: Activity log (swap audit trail)¶

The Activity log blade is the Portal anchor for swap-drift investigations - it gives you the audit trail of every ARM operation against the app, with timestamps, initiator, and status, while the KQL queries below give you the application-side evidence. This capture already shows the right scoping posture: the Timespan: Last 6 hours and Resource: app-test-20251107 chips are applied, narrowing the table to 11 items. of Succeeded operations - the same scope you want for a swap-window audit. The Event initiated by column (here user@example.com) tells you which principal triggered each operation, which is useful when correlating against the sticky-vs-non-sticky settings table in section 2. Use the KQL block below to verify the application-side restart and request behavior aligns with the audited operations.
3.10.1 HTTP request evidence¶
AppServiceHTTPLogs
| where TimeGenerated > ago(2h)
| where CsHost has "app-labswap"
| project TimeGenerated, CsUriStem, ScStatus, TimeTaken, CsHost
| order by TimeGenerated desc
3.10.2 Console startup evidence¶
AppServiceConsoleLogs
| where TimeGenerated > ago(2h)
| where ResultDescription has_any ("Starting gunicorn", "Booting worker", "App path is set", "Oryx", "Listening at")
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc
3.10.3 Platform lifecycle evidence¶
AppServicePlatformLogs
| where TimeGenerated > ago(2h)
| where Message has_any ("WarmUpProbeSucceeded", "Site startup probe succeeded", "Site started", "StoppingSite", "CreatingContainer", "PullingImage")
| project TimeGenerated, Level, Message
| order by TimeGenerated desc
3.11 Real output snippets from this lab¶
Platform warm-up progression:
2026-04-04T05:49:05.8804667Z State: Starting, Action: WarmUpProbeSucceeded ... Site startup probe succeeded after 55.2013831 seconds.
2026-04-04T05:49:06.6769441Z Site started.
Console startup progression:
[2026-04-04 05:49:04 +0000] [1895] [INFO] Starting gunicorn 25.3.0
[2026-04-04 05:49:04 +0000] [1896] [INFO] Booting worker with pid: 1896
[2026-04-04 05:49:04 +0000] [1897] [INFO] Booting worker with pid: 1897
HTTP verification calls:
2026-04-04T05:51:21.610129Z /config 200 47
2026-04-04T05:51:22.477189Z /config 200 14
2026-04-04T05:51:24.143418Z /diag/slots 200 11
3.12 Operator checklist¶
| Check | Pass condition |
|---|---|
| Sticky metadata validated | DB_CONNECTION_STRING has slotSetting=true |
| Pre-swap captured | /config captured for both slots |
| Swap command successful | No CLI error and status returned |
| Post-swap captured | /config captured for both slots |
| Behavior validated | FEATURE_FLAG exchanged, DB stayed |
| Lifecycle validated | warm-up/start events visible in KQL |
3.13 Decision tree for incident responders¶
flowchart TD
A[Post-swap config surprise] --> B{Is setting sticky?}
B -->|Yes| C[Expected to remain with slot]
B -->|No| D[Expected to move with swapped context]
C --> E{Observed unchanged?}
D --> F{Observed exchanged?}
E -->|Yes| G[Behavior correct]
E -->|No| H[Investigate setting metadata]
F -->|Yes| G
F -->|No| H
G --> I[Check warm-up and HTTP health]
H --> J[Review slotConfigNames and deployment pipeline] 4) Experiment Log¶
All data below is sourced from:
labs/slot-swap-config-drift/artifacts-sanitized/
4.1 Artifact inventory¶
| Category | Files |
|---|---|
| Baseline state | config-prod.json, config-staging.json |
| Baseline setting metadata | app-settings-prod.json, app-settings-staging.json |
| Baseline diagnostics | diag-slots-prod.json, diag-slots-staging.json, diag-stats-prod.json, diag-env-prod.json, app-config.json |
| Trigger state captures | config-prod-before-20260404T054527Z.json, config-staging-before-20260404T054527Z.json, config-prod-after-20260404T055128Z.json, config-staging-after-20260404T055128Z.json |
| Trigger diagnostics | diag-slots-postswap-20260404T055128Z.json, diag-stats-postswap-20260404T055128Z.json |
| KQL exports | kql-http-20260404T060610Z.json, kql-console-20260404T060610Z.json, kql-platform-20260404T060610Z.json |
4.2 Baseline evidence¶
4.2.1 Baseline config snapshots¶
Production baseline:
{"DB_CONNECTION_STRING":"prod-server.database.windows.net","FEATURE_FLAG":"v1","PROCESS_START_UTC":"2026-04-04T05:13:59.196136+00:00","WEBSITE_SLOT_NAME":"Production"}
Staging baseline:
{"DB_CONNECTION_STRING":"staging-server.database.windows.net","FEATURE_FLAG":"v2","PROCESS_START_UTC":"2026-04-04T05:34:01.911025+00:00","WEBSITE_SLOT_NAME":"Production"}
4.2.2 Baseline app setting metadata¶
Production (baseline/app-settings-prod.json):
| name | slotSetting | value |
|---|---|---|
| SCM_DO_BUILD_DURING_DEPLOYMENT | false | true |
| DB_CONNECTION_STRING | true | prod-server.database.windows.net |
| FEATURE_FLAG | false | v1 |
Staging (baseline/app-settings-staging.json):
| name | slotSetting | value |
|---|---|---|
| SCM_DO_BUILD_DURING_DEPLOYMENT | false | true |
| DB_CONNECTION_STRING | true | staging-server.database.windows.net |
| FEATURE_FLAG | false | v2 |
4.2.3 Baseline diagnostics¶
baseline/diag-stats-prod.json:
{"endpoint_counters":{"<unknown>":1,"diag_env":1,"diag_slots":1,"diag_stats":1,"index":1},"last_config_snapshot":null,"pid":1898,"process_start_time":"2026-04-04T05:13:59.135627+00:00","request_count":6,"uptime_seconds":1212.651}
baseline/diag-env-prod.json highlights:
| Key | Value |
|---|---|
| PORT | 8000 |
| DB_CONNECTION_STRING | prod-server.database.windows.net |
| FEATURE_FLAG | v1 |
| WEBSITE_HOSTNAME | redacted production host |
| WEBSITE_SLOT_NAME | <unset> |
4.3 Trigger pre-swap captures¶
From trigger artifacts:
| File | DB_CONNECTION_STRING | FEATURE_FLAG | PROCESS_START_UTC |
|---|---|---|---|
config-prod-before-20260404T054527Z.json | prod-server.database.windows.net | v1 | 2026-04-04T05:13:59.196136+00:00 |
config-staging-before-20260404T054527Z.json | staging-server.database.windows.net | v2 | 2026-04-04T05:34:02.102318+00:00 |
4.4 Trigger post-swap captures¶
From trigger artifacts:
| File | DB_CONNECTION_STRING | FEATURE_FLAG | PROCESS_START_UTC |
|---|---|---|---|
config-prod-after-20260404T055128Z.json | prod-server.database.windows.net | v2 | 2026-04-04T05:46:56.764695+00:00 |
config-staging-after-20260404T055128Z.json | staging-server.database.windows.net | v1 | 2026-04-04T05:49:05.541050+00:00 |
4.5 Assertion table: expected vs observed¶
| Assertion | Expected | Observed | Verdict |
|---|---|---|---|
| Production FEATURE_FLAG swaps from v1 to v2 | Yes | Yes | Pass |
| Staging FEATURE_FLAG swaps from v2 to v1 | Yes | Yes | Pass |
| Production DB remains prod-server | Yes | Yes | Pass |
| Staging DB remains staging-server | Yes | Yes | Pass |
4.6 Programmatic summary checks (artifact-derived)¶
| Check | Result |
|---|---|
feature_swapped | True |
db_sticky | True |
These checks produce a deterministic proof of sticky/non-sticky mechanics.
4.7 Process restart evidence¶
Using timestamp differences from before/after JSON:
| Perspective | Before | After | Delta seconds |
|---|---|---|---|
| Production endpoint | 05:13:59.196136 | 05:46:56.764695 | 1977.569 |
| Staging endpoint | 05:34:02.102318 | 05:49:05.541050 | 903.439 |
This indicates new process generations became active during swap lifecycle.
4.8 /diag/slots snapshots and config hash evolution¶
Pre-swap production (baseline/diag-slots-prod.json):
{"config_hash":"027d574d88e7b2722dc7da98bc5d1bb337a75f1331bb795d2a83aba6ce291fd7","current_config":{"DB_CONNECTION_STRING":"prod-server.database.windows.net","FEATURE_FLAG":"v1","PROCESS_START_UTC":"2026-04-04T05:13:59.196136+00:00","WEBSITE_SLOT_NAME":"Production"}}
Pre-swap staging (baseline/diag-slots-staging.json):
{"config_hash":"7f5ed03ca7b0fdd301f27ed036155631ca62ffce8a18b4fe5bce0235f02216bc","current_config":{"DB_CONNECTION_STRING":"staging-server.database.windows.net","FEATURE_FLAG":"v2","PROCESS_START_UTC":"2026-04-04T05:34:02.102318+00:00","WEBSITE_SLOT_NAME":"Production"}}
Post-swap sample (trigger/diag-slots-postswap-20260404T055128Z.json):
{"config_hash":"4069415d1d95ca3d38f723e476f8956726634a0b7fddf338499168a7d7198b3d","current_config":{"DB_CONNECTION_STRING":"prod-server.database.windows.net","FEATURE_FLAG":"v2","PROCESS_START_UTC":"2026-04-04T05:46:57.025552+00:00","WEBSITE_SLOT_NAME":"Production"}}
4.9 KQL export volume summary¶
| Export file | Rows |
|---|---|
kql-http-20260404T060610Z.json | 23 |
kql-console-20260404T060610Z.json | 94 |
kql-platform-20260404T060610Z.json | 200 |
4.10 HTTP evidence details¶
Representative rows from HTTP export:
| TimeGenerated (UTC) | Path | Status | TimeTaken (ms) | Host |
|---|---|---|---|---|
| 2026-04-04T05:45:28.655039Z | /config | 200 | 120 | production host |
| 2026-04-04T05:45:29.538204Z | /config | 200 | 30 | staging host |
| 2026-04-04T05:47:47.997103Z | / | 200 | 8 | staging host |
| 2026-04-04T05:47:48.045794Z | / | 200 | 29 | production host |
| 2026-04-04T05:50:24.579835Z | /config | 200 | 92 | production host |
| 2026-04-04T05:51:21.610129Z | /config | 200 | 47 | production host |
| 2026-04-04T05:51:22.477189Z | /config | 200 | 14 | staging host |
| 2026-04-04T05:51:24.143418Z | /diag/slots | 200 | 11 | production host |
Observation: both slots remained reachable and healthy during/after swap evidence window.
4.11 Console evidence details¶
Representative console lines:
| TimeGenerated (UTC) | Message |
|---|---|
| 2026-04-04T05:49:04.0022069Z | Starting gunicorn 25.3.0 |
| 2026-04-04T05:49:04.0105079Z | Listening at: http://0.0.0.0:8000 (1895) |
| 2026-04-04T05:49:04.0742069Z | Booting worker with pid: 1896 |
| 2026-04-04T05:49:04.163576Z | Booting worker with pid: 1897 |
| 2026-04-04T05:49:04.6153423Z | Control socket listening at /root/.gunicorn/gunicorn.ctl |
| 2026-04-04T05:48:51.0515103Z | Oryx Version: 0.2.20260130.1 ... |
Observation: process lifecycle aligns with swap-time startup sequence.
4.12 Platform lifecycle evidence details¶
Representative platform lines:
| TimeGenerated (UTC) | Level | Message excerpt |
|---|---|---|
| 2026-04-04T05:48:01.0779209Z | Informational | Action: PullingImage |
| 2026-04-04T05:48:06.2231082Z | Informational | Action: CreatingContainer ... |
| 2026-04-04T05:48:10.1875541Z | Informational | Action: WaitingForSiteWarmUpProbeSuccess |
| 2026-04-04T05:49:05.8804667Z | Informational | Action: WarmUpProbeSucceeded ... 55.2013831 seconds |
| 2026-04-04T05:49:06.6769441Z | Informational | Site started. |
| 2026-04-04T05:49:26.7486051Z | Informational | State: Stopping, Action: StoppingSite |
| 2026-04-04T05:49:32.464447Z | Informational | Site ... stopped. |
These lines validate that the swap involved observable start/stop/warm-up transitions.
4.13 Hypothesis evaluation matrix¶
| Criterion | Status | Evidence source |
|---|---|---|
| Pre-swap value divergence present | Pass | config-prod-before, config-staging-before |
| Non-sticky value exchanged | Pass | config-prod-after, config-staging-after |
| Sticky value remained slot-bound | Pass | same artifacts + metadata |
| Sticky metadata explicit | Pass | app-settings-prod, app-settings-staging |
| Lifecycle around swap observed | Pass | kql-console, kql-platform |
Final hypothesis verdict: Supported.
4.14 Key finding (required conclusion)¶
Perfect config drift demo
The experiment precisely matches documented swap behavior:
FEATURE_FLAG(non-sticky) swapped with code.DB_CONNECTION_STRING(sticky/slot-setting) stayed with the slot.
This proves the sticky setting mechanism works exactly as documented.
4.15 Practical lessons for release engineering¶
- Treat sticky/non-sticky classification as release-critical metadata.
- Include pre/post slot snapshots in every production swap runbook.
- Align incident response language with slot semantics ("expected movement" vs "drift").
- Use KQL lifecycle evidence to distinguish swap mechanics from app defects.
4.16 Recommended production controls¶
| Control | Why |
|---|---|
IaC enforcement for slotConfigNames | Prevent accidental non-sticky secrets |
| Swap pre-check script | Catch config surprises before cutover |
| Post-swap smoke tests on both hostnames | Verify both resulting roles |
| KQL dashboards for swap windows | Fast incident interpretation |
4.17 Pre-swap checklist template¶
| Item | Complete |
|---|---|
| Sticky settings inventory reviewed | ☐ |
| Non-sticky feature flags reviewed | ☐ |
| Pre-swap snapshots captured (both slots) | ☐ |
| Warm-up endpoint validated | ☐ |
| Rollback strategy documented | ☐ |
4.18 Post-swap checklist template¶
| Item | Complete |
|---|---|
Production /config verified | ☐ |
Staging /config verified | ☐ |
| Sticky values unchanged by slot | ☐ |
| Non-sticky values exchanged | ☐ |
| Startup/warm-up logs healthy | ☐ |
4.19 Reproducibility and data integrity notes¶
- Every numeric and string value in this section is sourced from checked-in sanitized artifacts.
- Hostnames and subscription identifiers are redacted where required.
- No placeholders were used for experimental result tables.
4.20 Limitations and scope¶
This lab validates swap/config semantics for this specific setup:
- Linux App Service, Python runtime.
- S1 plan, single app with one staging slot.
- Two controlled app settings.
It does not attempt to benchmark multi-slot, multi-instance, or multi-region swap complexity.
Expected Evidence¶
This section defines what you SHOULD observe at each phase of the lab. Use it to validate your investigation is on track.
Before Trigger (Baseline)¶
| Evidence Source | Expected State | What to Capture |
|---|---|---|
Slot configuration endpoint (/config) | Production and staging show different values before swap | Production: DB_CONNECTION_STRING=prod-server..., Staging: DB_CONNECTION_STRING=staging-server... |
| App settings metadata | Sticky classification is explicitly configured | DB_CONNECTION_STRING has slotSetting=true; FEATURE_FLAG has slotSetting=false |
| Baseline slot snapshots | Pre-swap state recorded for both slots | config-prod-before, config-staging-before, diag-slots artifacts |
During Incident¶
| Evidence Source | Expected State | Key Indicator |
|---|---|---|
/config after swap | App stays healthy but values shift by sticky rules | Production /config returns 200 in ~63 ms while showing swapped/non-swapped mix |
| AppServiceConsoleLogs | Runtime boot appears normal during swap window | Gunicorn startup lines and Listening at: http://0.0.0.0:8000 |
| AppServicePlatformLogs | Swap lifecycle includes startup churn without direct app crash | LastError: ContainerTimeout visible from deployment/startup lifecycle context |
After Recovery¶
| Evidence Source | Expected State | Key Indicator |
|---|---|---|
| Post-swap config comparison | Non-sticky moved, sticky remained slot-bound | FEATURE_FLAG exchanged, DB_CONNECTION_STRING stayed with each slot |
| Slot behavior interpretation | Swap operation itself is healthy | Endpoints remain 200 while data-plane config differs by slot policy |
| Operational conclusion | Root cause is config governance, not swap failure | Missing/incorrect sticky setting design explains perceived drift |
Evidence Timeline¶
graph TD
A[Baseline Capture] --> B[Trigger Fault]
B --> C[During: Collect Evidence]
C --> D[After: Compare to Baseline]
D --> E[Verdict: Confirmed/Falsified] Evidence Chain: Why This Proves the Hypothesis¶
Falsification Logic
If you observe healthy HTTP 200 behavior with post-swap configuration values that follow sticky/non-sticky rules (FEATURE_FLAG swapped, DB_CONNECTION_STRING remained slot-bound), the hypothesis is CONFIRMED because the platform executed swap correctly and the perceived drift comes from slot-setting semantics.
If you do NOT observe this pattern (for example sticky values swap unexpectedly or swap fails operationally), the hypothesis is FALSIFIED — consider alternatives such as missing slotConfigNames, manual config mutation, or incomplete swap execution.
Clean Up¶
Related Playbook¶
See Also¶
- Playbook: Slot Swap Restart / Config Drift / Warm-up Race
- Playbook: Slot Swap Failed During Warm-up
- Playbook: Warm-up vs Health Check
- KQL: Restart Timing Correlation
- Troubleshooting Method