Lab Guide: Hosting Plan Security Matrix — Private Endpoint + Managed Identity Fault Injection
This Level 4 lab guide reproduces the same four security and access faults across four Azure Functions hosting plans and documents how each plan fails differently. The goal is to build evidence-driven, plan-specific troubleshooting muscle so operators can stop using one-size-fits-all runbooks for fundamentally different runtime architectures.
| Field | Value |
| Lab focus | Cross-plan security fault injection and failure signature mapping |
| Plans tested | Flex Consumption (FC1), Premium (EP1), Consumption (Y1), Dedicated (S1) |
| Fault types | RBAC removal, identity removal, DNS break, network path break |
| Failure trigger method | Controlled CLI fault injection per plan with timed observations |
| Diagnostic approach | HTTP probing, DNS resolution checks, Azure Monitor signals, Activity Log correlation |
| Difficulty | Level 4 — Advanced cross-plan |
| Estimated duration | 4-6 hours for full matrix |
| Runtime profile | Azure Functions v4 (Python v2 reference app) |
| Storage model coverage | Identity-based storage access with private endpoint and public-only variants |
| Output artifact | Cross-plan error signature matrix with timeline evidence |
What makes this lab different from single-plan labs
This lab is intentionally asymmetric.
The same fault is injected into different hosting architectures. The objective is not "does it fail" but "how does it fail on each plan, how fast, and with what diagnostic shape".
A successful run ends with a plan-specific incident response matrix, not a generic checklist.
1) Background
A matrix approach matters because Azure Functions hosting plans are not runtime-equivalent for networking, identity token lifecycle, storage boot dependencies, and control-plane/data-plane interaction.
In real incidents, teams often transfer a successful Premium or Dedicated triage sequence into Flex or Consumption and miss the root cause window. This lab demonstrates why that happens.
1.1 Why generic troubleshooting fails
| Operational assumption | Why it fails in cross-plan reality |
| "A managed identity change should fail immediately" | False on always-on plans (EP1, S1) due to token caching; often true on recycle-prone plans (FC1, Y1) |
| "Health endpoint means app is healthy" | False for FC1 in DNS break case: /api/health can return 200 while storage operations timeout |
| "Private endpoint deletion always causes obvious DNS failure" | False on FC1 and EP1 where stale private DNS records can remain, causing deceptive resolution |
| "503 means the same root cause everywhere" | False: EP1 can differentiate runtime-host fault vs container/app-layer page; S1 may collapse to one 503 signature |
| "No VNet issue means no networking issue" | Y1 has no VNet/PE path, so network fault classes are structurally out of scope |
1.2 Cross-plan architecture map
flowchart TB
subgraph FC1[FC1 Flex Consumption]
FCAPP[Function App FC1]
FCVNET[VNet Integration]
FCMI[User-assigned Managed Identity]
FCST[Storage Account allowSharedKeyAccess false]
FCBLOB[Private Endpoint blob]
FCQUEUE[Private Endpoint queue]
FCTABLE[Private Endpoint table]
FCFILE[Private Endpoint file]
FCDNS[Private DNS Zones]
FCAPP --> FCVNET
FCAPP --> FCMI
FCAPP --> FCST
FCVNET --> FCBLOB
FCVNET --> FCQUEUE
FCVNET --> FCTABLE
FCVNET --> FCFILE
FCBLOB --> FCDNS
FCQUEUE --> FCDNS
FCTABLE --> FCDNS
FCFILE --> FCDNS
end
subgraph EP1[EP1 Premium]
EPAPP[Function App EP1]
EPVNET[VNet Integration]
EPMI[System-assigned Managed Identity]
EPST[Storage Account]
EPBLOB[Private Endpoint blob]
EPQUEUE[Private Endpoint queue]
EPTABLE[Private Endpoint table]
EPFILE[Private Endpoint file for content share]
EPCONTENT[vnetContentShareEnabled true]
EPDNS[Private DNS Zones]
EPAPP --> EPVNET
EPAPP --> EPMI
EPAPP --> EPST
EPAPP --> EPCONTENT
EPVNET --> EPBLOB
EPVNET --> EPQUEUE
EPVNET --> EPTABLE
EPVNET --> EPFILE
EPBLOB --> EPDNS
EPQUEUE --> EPDNS
EPTABLE --> EPDNS
EPFILE --> EPDNS
end
subgraph Y1[Y1 Consumption]
YAPP[Function App Y1]
YMI[System-assigned Managed Identity]
YST[Storage Account Public Endpoints]
YCFG[AzureWebJobsStorage__accountName]
YAPP --> YMI
YAPP --> YST
YAPP --> YCFG
end
subgraph S1[Dedicated S1]
SAPP[Function App S1]
SVNET[VNet Integration]
SMI[System-assigned Managed Identity]
SST[Storage Account]
SBLOB[Private Endpoint blob]
SQUEUE[Private Endpoint queue]
STABLE[Private Endpoint table]
SSITE[Site Private Endpoint]
SRUN[Run From Package no file PE]
SDNS[Private DNS Zones]
SAPP --> SVNET
SAPP --> SMI
SAPP --> SST
SAPP --> SRUN
SVNET --> SBLOB
SVNET --> SQUEUE
SVNET --> STABLE
SVNET --> SSITE
SBLOB --> SDNS
SQUEUE --> SDNS
STABLE --> SDNS
end
1.3 Asymmetric Feature Matrix
| Capability | FC1 | EP1 | Y1 | S1 Dedicated |
| VNet integration | Yes | Yes | No | Yes |
| Storage private endpoints | 4 (blob,queue,table,file) | 4 (blob,queue,table,file) | No | 3 (blob,queue,table) |
| File endpoint requirement | Yes | Yes (content share path) | No | No (run-from-package pattern) |
| Site private endpoint | Optional | Optional | No | Yes (tested) |
| Managed identity mode in test | User-assigned | System-assigned | System-assigned | System-assigned |
allowSharedKeyAccess: false tested | Yes | Not primary variable | Not primary variable | Not primary variable |
vnetContentShareEnabled relevance | No | Yes | No | No |
| DNS private zone dependency | High | High | None | Medium-High |
| Always-on token cache behavior | Low (recycle-prone) | High | Low | High |
| DNS fault class applicable | Yes | Yes | No | Yes |
| Network path fault class applicable | Yes | Yes | No | Yes |
| Uniform error-page behavior across faults | No | No | Mostly generic | Mostly uniform |
1.4 Lab objective framing
This lab validates that fault semantics are plan-bound.
Instead of asking "what is the error", ask:
- Which hosting plan produced the error?
- Which layer emitted the error page or timeout pattern?
- Did the identity token cache mask the change?
- Did DNS resolution shape differ from data-path accessibility?
- Is the fault class even possible on this plan?
The rest of the guide is organized to produce those answers with repeatable evidence.
2) Hypothesis
If we apply identical fault injections (RBAC removal, identity removal, DNS break, network path break) across all 4 hosting plans, each plan will exhibit distinct failure signatures, propagation timelines, and diagnostic indicators that require plan-specific troubleshooting approaches.
2.2 Causal chain
flowchart LR
A[Apply same fault class to FC1 EP1 Y1 S1] --> B[Plan-specific infrastructure path activated]
B --> C[Different token lifecycle and dependency startup behavior]
C --> D[Different external symptom shape]
D --> E[Different monitoring signatures]
E --> F[Different triage branch required]
F --> G[Need plan-specific runbook]
2.3 Proof criteria
| ID | Requirement | Evidence expectation |
| P1 | Same fault does not produce identical user-facing error signatures across all plans | Matrix table contains distinct HTTP/result patterns per plan |
| P2 | Identity removal is time-asymmetric by plan | EP1 and S1 continue functioning after removal until restart/expiry; FC1 and Y1 degrade quickly |
| P3 | DNS break can produce non-obvious health state | FC1 health endpoint remains 200 while storage paths fail |
| P4 | EP1 fault class can be inferred from error page variant | RBAC shows runtime-host style failure; DNS/network shows application error page |
| P5 | Y1 excludes VNet-only fault classes by design | DNS/network fault rows marked not applicable |
| P6 | Recovery and disambiguation procedure differs by plan | Runbook branches require plan-specific checks and restart strategy |
2.4 Disproof criteria
| ID | Disproof condition | Interpretation |
| D1 | All plans produce same observable error for all faults | Hypothesis invalid; troubleshooting could be generic |
| D2 | Identity removal impacts all plans immediately without restart differences | Token caching asymmetry claim not supported |
| D3 | FC1 health endpoint fails at same time as storage calls in DNS break | Invisible failure claim not supported |
| D4 | EP1 shows same error page shape for RBAC and DNS/network faults | Error page differentiation claim not supported |
| D5 | Y1 reproduces VNet DNS/network faults | Plan-bound scope assumption invalid |
2.5 Competing explanations to guard against
| Competing explanation | How this lab controls for it |
| Random cold starts caused observed variance | Timed injections and repeated probes separate cold-start effects from fault effects |
| App code bug created all failures | Same app package deployed across plans before fault injection; baseline healthy |
| Storage service outage | Parallel multi-plan asymmetry plus targeted recovery reversibility indicates local configuration fault |
| Monitoring delay created fake timeline | HTTP probes and DNS checks run alongside telemetry to correlate ground truth |
3) Runbook
3.1 Prerequisites
| Requirement | Validation command |
| Azure CLI installed | az version |
| Logged in and correct subscription selected | az account show --output table |
| Permission to deploy networking, private endpoints, and role assignments | az role assignment list --assignee "<object-id>" --all |
| MkDocs project cloned locally | ls |
Optional: jq for output parsing | jq --version |
Use canonical variables:
LOCATION="koreacentral"
SUBSCRIPTION_ID="<subscription-id>"
BASE_FC1="lab-matrix-fc1"
BASE_EP1="lab-matrix-ep1"
BASE_Y1="lab-matrix-y1"
BASE_S1="lab-matrix-s1"
RG_FC1="rg-${BASE_FC1}"
RG_EP1="rg-${BASE_EP1}"
RG_Y1="rg-${BASE_Y1}"
RG_S1="rg-${BASE_S1}"
APP_FC1="${BASE_FC1}-func"
APP_EP1="${BASE_EP1}-func"
APP_Y1="${BASE_Y1}-func"
APP_S1="${BASE_S1}-func"
STORAGE_FC1="${BASE_FC1//-/}storage"
STORAGE_EP1="${BASE_EP1//-/}storage"
STORAGE_Y1="${BASE_Y1//-/}storage"
STORAGE_S1="${BASE_S1//-/}storage"
IDENTITY_FC1="${BASE_FC1}-identity"
3.2 Deployment sequence
Deploy one plan at a time so baseline and failure windows are easier to isolate.
3.2.1 FC1 deployment
az group create \
--name "$RG_FC1" \
--location "$LOCATION"
az deployment group create \
--resource-group "$RG_FC1" \
--template-file "infra/flex-consumption/main.bicep" \
--parameters \
baseName="$BASE_FC1"
3.2.2 EP1 deployment
az group create \
--name "$RG_EP1" \
--location "$LOCATION"
az deployment group create \
--resource-group "$RG_EP1" \
--template-file "infra/premium/main.bicep" \
--parameters \
baseName="$BASE_EP1"
3.2.3 Y1 deployment
az group create \
--name "$RG_Y1" \
--location "$LOCATION"
az deployment group create \
--resource-group "$RG_Y1" \
--template-file "infra/consumption/main.bicep" \
--parameters \
baseName="$BASE_Y1"
3.2.4 Dedicated S1 deployment
az group create \
--name "$RG_S1" \
--location "$LOCATION"
az deployment group create \
--resource-group "$RG_S1" \
--template-file "infra/dedicated/main.bicep" \
--parameters \
baseName="$BASE_S1"
3.3 Baseline collection procedure
3.3.1 Endpoint probe loop
for APP_HOST in "$APP_FC1.azurewebsites.net" "$APP_EP1.azurewebsites.net" "$APP_Y1.azurewebsites.net" "$APP_S1.azurewebsites.net"; do
echo "=== $APP_HOST ==="
curl --silent --show-error --max-time 15 "https://$APP_HOST/api/health" --output /dev/null --write-out "health:%{http_code} total:%{time_total}\n"
curl --silent --show-error --max-time 15 "https://$APP_HOST/api/blob/read" --output /dev/null --write-out "read:%{http_code} total:%{time_total}\n"
curl --silent --show-error --max-time 15 "https://$APP_HOST/api/blob/write" --output /dev/null --write-out "write:%{http_code} total:%{time_total}\n"
done
3.3.2 DNS baseline collection for PE plans
for STORAGE_NAME in "$STORAGE_FC1" "$STORAGE_EP1" "$STORAGE_S1"; do
echo "=== $STORAGE_NAME ==="
nslookup "$STORAGE_NAME.blob.core.windows.net"
nslookup "$STORAGE_NAME.queue.core.windows.net"
nslookup "$STORAGE_NAME.table.core.windows.net"
done
3.3.3 RBAC baseline snapshot
# FC1 uses user-assigned identity — get the UAI principal, not the app principal
MI_PRINCIPAL_FC1=$(az identity show --resource-group "$RG_FC1" --name "$IDENTITY_FC1" --query "principalId" --output tsv)
APP_PRINCIPAL_EP1=$(az functionapp identity show --resource-group "$RG_EP1" --name "$APP_EP1" --query "principalId" --output tsv)
APP_PRINCIPAL_Y1=$(az functionapp identity show --resource-group "$RG_Y1" --name "$APP_Y1" --query "principalId" --output tsv)
APP_PRINCIPAL_S1=$(az functionapp identity show --resource-group "$RG_S1" --name "$APP_S1" --query "principalId" --output tsv)
STORAGE_ID_FC1=$(az storage account show --resource-group "$RG_FC1" --name "$STORAGE_FC1" --query "id" --output tsv)
STORAGE_ID_EP1=$(az storage account show --resource-group "$RG_EP1" --name "$STORAGE_EP1" --query "id" --output tsv)
STORAGE_ID_Y1=$(az storage account show --resource-group "$RG_Y1" --name "$STORAGE_Y1" --query "id" --output tsv)
STORAGE_ID_S1=$(az storage account show --resource-group "$RG_S1" --name "$STORAGE_S1" --query "id" --output tsv)
az role assignment list --assignee "$MI_PRINCIPAL_FC1" --scope "$STORAGE_ID_FC1" --output table
az role assignment list --assignee "$APP_PRINCIPAL_EP1" --scope "$STORAGE_ID_EP1" --output table
az role assignment list --assignee "$APP_PRINCIPAL_Y1" --scope "$STORAGE_ID_Y1" --output table
az role assignment list --assignee "$APP_PRINCIPAL_S1" --scope "$STORAGE_ID_S1" --output table
3.3.4 Baseline KQL pack
let startTime = ago(30m);
requests
| where timestamp > startTime
| where cloud_RoleName in ("lab-matrix-fc1-func","lab-matrix-ep1-func","lab-matrix-y1-func","lab-matrix-s1-func")
| summarize Calls=count(), Failures=countif(success == false), P95Ms=percentile(duration,95) by cloud_RoleName, operation_Name, resultCode
| order by cloud_RoleName asc, operation_Name asc
let startTime = ago(30m);
dependencies
| where timestamp > startTime
| where cloud_RoleName in ("lab-matrix-fc1-func","lab-matrix-ep1-func","lab-matrix-y1-func","lab-matrix-s1-func")
| where target has "core.windows.net"
| summarize Calls=count(), Failed=countif(success == false), P95Ms=percentile(duration,95) by cloud_RoleName, target, resultCode
| order by cloud_RoleName asc, target asc
3.4 Fault Injection 1: RBAC removal
3.4.1 Injection command
az role assignment delete \
--assignee-object-id "$MI_PRINCIPAL_FC1" \
--role "Storage Blob Data Owner" \
--scope "$STORAGE_ID_FC1"
az role assignment delete \
--assignee-object-id "$APP_PRINCIPAL_EP1" \
--role "Storage Blob Data Owner" \
--scope "$STORAGE_ID_EP1"
az role assignment delete \
--assignee-object-id "$APP_PRINCIPAL_Y1" \
--role "Storage Blob Data Owner" \
--scope "$STORAGE_ID_Y1"
az role assignment delete \
--assignee-object-id "$APP_PRINCIPAL_S1" \
--role "Storage Blob Data Owner" \
--scope "$STORAGE_ID_S1"
3.4.2 Expected observation by plan
| Plan | Expected immediate behavior | Typical emergence window |
| FC1 | Intermittent success then timeout/502 pattern | T+30s to T+210s |
| EP1 | Deterministic 503 Function host is not running | Around T+180s |
| Y1 | 502 generic server error page | Within 2-5 minutes |
| S1 | 503 Function host is not running | Around T+120s to T+240s |
3.4.3 Evidence collection commands
for APP_HOST in "$APP_FC1.azurewebsites.net" "$APP_EP1.azurewebsites.net" "$APP_Y1.azurewebsites.net" "$APP_S1.azurewebsites.net"; do
curl --silent --show-error --max-time 15 "https://$APP_HOST/api/blob/read" --output /dev/null --write-out "$APP_HOST read:%{http_code} total:%{time_total}\n"
done
let startTime = ago(20m);
requests
| where timestamp > startTime
| where cloud_RoleName in ("lab-matrix-fc1-func","lab-matrix-ep1-func","lab-matrix-y1-func","lab-matrix-s1-func")
| summarize Calls=count(), Failures=countif(success == false) by cloud_RoleName, resultCode
| order by cloud_RoleName asc
3.4.4 Recovery procedure
az role assignment create --assignee-object-id "$MI_PRINCIPAL_FC1" --role "Storage Blob Data Owner" --scope "$STORAGE_ID_FC1"
az role assignment create --assignee-object-id "$APP_PRINCIPAL_EP1" --role "Storage Blob Data Owner" --scope "$STORAGE_ID_EP1"
az role assignment create --assignee-object-id "$APP_PRINCIPAL_Y1" --role "Storage Blob Data Owner" --scope "$STORAGE_ID_Y1"
az role assignment create --assignee-object-id "$APP_PRINCIPAL_S1" --role "Storage Blob Data Owner" --scope "$STORAGE_ID_S1"
Wait for propagation, then probe again.
3.5 Fault Injection 2: identity removal
3.5.1 Injection command
FC1 uses a user-assigned identity attachment, while EP1, Y1, and S1 use system-assigned identity.
# FC1: remove user-assigned identity
UAI_ID=$(az identity show --resource-group "$RG_FC1" --name "$IDENTITY_FC1" --query "id" --output tsv)
az functionapp identity remove \
--resource-group "$RG_FC1" \
--name "$APP_FC1" \
--identities "$UAI_ID"
# EP1, Y1, S1: remove system-assigned identity
az functionapp identity remove --resource-group "$RG_EP1" --name "$APP_EP1" --identities "[system]"
az functionapp identity remove --resource-group "$RG_Y1" --name "$APP_Y1" --identities "[system]"
az functionapp identity remove --resource-group "$RG_S1" --name "$APP_S1" --identities "[system]"
3.5.2 Expected observation by plan
| Plan | Expected immediate behavior | Important caveat |
| FC1 | Full timeout pattern within ~2 minutes | More severe than RBAC removal |
| EP1 | No immediate failure after identity removal | Failure appears after restart or token expiry window |
| Y1 | 502 within ~2 minutes | No restart required to observe |
| S1 | No immediate failure after removal | Restart required for deterministic exposure |
3.5.3 Evidence collection commands
for APP_HOST in "$APP_FC1.azurewebsites.net" "$APP_EP1.azurewebsites.net" "$APP_Y1.azurewebsites.net" "$APP_S1.azurewebsites.net"; do
curl --silent --show-error --max-time 15 "https://$APP_HOST/api/health" --output /dev/null --write-out "$APP_HOST health:%{http_code} total:%{time_total}\n"
curl --silent --show-error --max-time 15 "https://$APP_HOST/api/blob/read" --output /dev/null --write-out "$APP_HOST read:%{http_code} total:%{time_total}\n"
done
To expose cache-masked impact on EP1 and S1:
az functionapp restart --resource-group "$RG_EP1" --name "$APP_EP1"
az functionapp restart --resource-group "$RG_S1" --name "$APP_S1"
3.5.4 Recovery procedure
az functionapp identity assign --resource-group "$RG_EP1" --name "$APP_EP1"
az functionapp identity assign --resource-group "$RG_Y1" --name "$APP_Y1"
az functionapp identity assign --resource-group "$RG_S1" --name "$APP_S1"
For FC1 user-assigned identity:
UAI_ID=$(az identity show --resource-group "$RG_FC1" --name "$IDENTITY_FC1" --query "id" --output tsv)
az functionapp identity assign --resource-group "$RG_FC1" --name "$APP_FC1" --identities "$UAI_ID"
Restore required storage roles after identity reattachment.
3.6 Fault Injection 3: DNS break
3.6.1 Injection command
This fault applies to FC1, EP1, and S1 only.
Option A (preferred): break private DNS links for storage zones.
az network private-dns link vnet delete \
--resource-group "$RG_FC1" \
--zone-name "privatelink.blob.core.windows.net" \
--name "$BASE_FC1-blob-dns-link" \
--yes
Repeat for queue, table, and (where applicable) file zones, substituting the service name. DNS link names follow the pattern ${BASE}-${service}-dns-link.
Option B: inject incorrect A records in private zones.
az network private-dns record-set a add-record \
--resource-group "$RG_EP1" \
--zone-name "privatelink.blob.core.windows.net" \
--record-set-name "$STORAGE_EP1" \
--ipv4-address "10.255.255.254"
3.6.2 Expected observation by plan
| Plan | Expected symptom |
| FC1 | /api/health can remain 200 while storage endpoints timeout |
| EP1 | 503 :( Application Error after restart |
| Y1 | Not applicable |
| S1 | 503 Function host is not running (less differentiable vs RBAC) |
3.6.3 Evidence collection commands
nslookup "$STORAGE_FC1.blob.core.windows.net"
nslookup "$STORAGE_EP1.blob.core.windows.net"
nslookup "$STORAGE_S1.blob.core.windows.net"
curl --silent --show-error --max-time 15 "https://$APP_FC1.azurewebsites.net/api/health" --output /dev/null --write-out "fc1 health:%{http_code} total:%{time_total}\n"
curl --silent --show-error --max-time 15 "https://$APP_FC1.azurewebsites.net/api/blob/read" --output /dev/null --write-out "fc1 read:%{http_code} total:%{time_total}\n"
3.6.4 Recovery procedure
Restore DNS links and remove bad records.
az network private-dns record-set a remove-record \
--resource-group "$RG_EP1" \
--zone-name "privatelink.blob.core.windows.net" \
--record-set-name "$STORAGE_EP1" \
--ipv4-address "10.255.255.254"
Restart EP1 and S1 to flush stale resolution state where needed.
3.7 Fault Injection 4: network path break
3.7.1 Injection command
This fault applies to FC1, EP1, and S1 only.
Option A: delete storage private endpoints.
az network private-endpoint delete --resource-group "$RG_FC1" --name "$BASE_FC1-pe-blob"
az network private-endpoint delete --resource-group "$RG_FC1" --name "$BASE_FC1-pe-queue"
az network private-endpoint delete --resource-group "$RG_FC1" --name "$BASE_FC1-pe-table"
az network private-endpoint delete --resource-group "$RG_FC1" --name "$BASE_FC1-pe-file"
Option B: restrict subnet NSG/UDR path used by integration subnet.
Prerequisite
The shipped Bicep templates do not create an NSG. You must create and associate one with the integration subnet before using this approach. For example:
az network nsg create \
--resource-group "$RG_S1" \
--name "nsg-s1-integration" \
--location "$LOCATION"
az network vnet subnet update \
--resource-group "$RG_S1" \
--vnet-name "$BASE_S1-vnet" \
--name "subnet-integration" \
--network-security-group "nsg-s1-integration"
az network nsg rule create \
--resource-group "$RG_S1" \
--nsg-name "nsg-s1-integration" \
--name "deny-storage-443" \
--priority 120 \
--direction Outbound \
--access Deny \
--protocol Tcp \
--source-address-prefixes "*" \
--source-port-ranges "*" \
--destination-address-prefixes "Storage" \
--destination-port-ranges "443"
3.7.2 Expected observation by plan
| Plan | Expected symptom |
| FC1 | Timeout first, then 502; health can remain briefly OK |
| EP1 | 503 :( Application Error |
| Y1 | Not applicable |
| S1 | 503 Function host is not running; DNS may revert to public IPs quickly |
3.7.3 Evidence collection commands
nslookup "$STORAGE_FC1.blob.core.windows.net"
nslookup "$STORAGE_EP1.blob.core.windows.net"
nslookup "$STORAGE_S1.blob.core.windows.net"
for APP_HOST in "$APP_FC1.azurewebsites.net" "$APP_EP1.azurewebsites.net" "$APP_S1.azurewebsites.net"; do
curl --silent --show-error --max-time 15 "https://$APP_HOST/api/blob/read" --output /dev/null --write-out "$APP_HOST read:%{http_code} total:%{time_total}\n"
done
3.7.4 Recovery procedure
Recreate private endpoints or remove blocking NSG rule.
az network nsg rule delete \
--resource-group "$RG_S1" \
--nsg-name "nsg-s1-integration" \
--name "deny-storage-443"
4) Experiment Log
This section contains the primary evidence from the full matrix run.
4.1 Error Signature Matrix (key finding)
| Plan | RBAC Removal | Identity Removal | DNS Break | Network Break |
| FC1 (Flex) | HTTP 000 timeout / 502 generic | HTTP 000 timeout (all down) | HTTP 000 timeout (health OK briefly) | HTTP 000 timeout -> 502 |
| EP1 (Premium) | 503 "Function host is not running" | HTTP 000 timeout (only after restart!) | 503 ":( Application Error" | 503 ":( Application Error" |
| Y1 (Consumption) | 502 generic "Server Error" | 502 generic "Server Error" | N/A (no VNet) | N/A (no VNet) |
| Dedicated (S1) | 503 "Function host is not running" | HTTP 000 timeout (after restart) | 503 "Function host is not running" | 503 "Function host is not running" |
4.2 Key Behavioral Discoveries
4.2.1 Token caching on always-on plans
| Finding | Evidence |
EP1 and S1 cache identity tokens aggressively | Identity removal showed no immediate endpoint impact in first observation window |
| Failure appears after restart or token expiry window (~24h class behavior) | Restart created deterministic failure manifestation |
FC1 and Y1 degrade quickly without restart | Failures observed within 2-5 minutes after identity removal |
4.2.2 DNS break can be invisible on FC1
| Finding | Evidence |
/api/health remained 200 while storage operations timed out | FC1 DNS break run showed green health with storage path failures |
| Health-only probes can miss total workload outage | Storage-specific checks detected incident when synthetic health did not |
4.2.3 EP1 error page differentiation
| Fault type | EP1 external page | Interpretation |
| RBAC removal | 503 Function host is not running | Runtime-host layer dependency startup fault |
| DNS break | 503 :( Application Error | App/container-path failure presentation |
| Network break | 503 :( Application Error | Similar presentation to DNS break |
Dedicated behavior contrast:
| Fault type | S1 external page |
| RBAC removal | 503 Function host is not running |
| DNS break | 503 Function host is not running |
| Network break | 503 Function host is not running |
4.2.4 Stale DNS after private endpoint deletion
| Plan | DNS behavior after PE deletion | Operational impact |
| FC1 | Private DNS retained stale private IPs | Looks healthy at DNS layer, data path still broken |
| EP1 | Similar stale private IP retention observed | Misleads operators who only validate name resolution |
| S1 | DNS reverted to public IPs quickly | Still fails due to storage firewall/public path block |
4.2.5 Y1 boundary condition
| Constraint | Effect |
| No VNet support | DNS and private-endpoint network fault classes are not applicable |
| Public endpoint dependency | MI-based storage works, but deployment publish flow may need temporary connection string workaround |
4.2.6 Detection difficulty ranking
| Rank | Fault pattern | Why detection is difficult |
| 1 | Identity removal on EP1/S1 | Cache-masked and appears healthy until restart/expiry |
| 2 | DNS break on FC1 | Health endpoint remains green while storage calls fail |
| 3 | PE deletion with stale DNS | Resolution appears valid despite broken path |
| 4 | RBAC removal | Usually produces clear and fast observable failure signatures |
4.3 Per-plan detailed evidence
4.3.1 FC1 (Flex Consumption)
Baseline
| Signal | Observation |
/api/health | 200 |
/api/blob/read | 200 |
/api/blob/write | 200/201 |
DNS (blob,queue,table,file) | Private IP resolution |
| Overall state | Healthy baseline |
Fault A: RBAC removal
| Relative time | Observation |
| T+00s | RBAC role removed |
| T+30s | Some operations still succeed from cached credentials |
| T+90s | Increased latency and first intermittent failures |
| T+150s | Read/write increasingly timeout |
| T+210s | Primary symptom: HTTP 000 timeout and occasional 502 |
| T+240s | Cold-start retries occasionally succeed, then fail again |
| Signature class | Evidence |
| External symptom | HTTP 000 timeout / generic 502 |
| Strength of signal | Moderate (intermittent window creates ambiguity) |
| Differentiation from identity removal | Less severe early phase due to cached authorization path |
Fault B: identity removal
| Relative time | Observation |
| T+00s | Identity detached |
| T+30s | Early request slowdowns |
| T+60s | Storage operations begin timing out |
| T+120s | All tested endpoints timeout |
| T+180s | Persistent outage pattern |
| Signature class | Evidence |
| External symptom | HTTP 000 timeout across endpoints |
| Severity | Higher than RBAC removal |
| Why | No usable token path at all |
Fault C: DNS break
| Relative time | Observation |
| T+00s | DNS link/record broken |
| T+30s | /api/health remains 200 |
| T+60s | /api/blob/read timeout starts |
| T+120s | /api/blob/write timeout persists |
| T+180s | Semi-functional state continues |
| Signature class | Evidence |
| External symptom | Storage calls timeout while health is green |
| Diagnostic trap | Health-only checks miss incident |
| Required check | Storage-specific probe endpoints |
Fault D: network break
| Relative time | Observation |
| T+00s | PE path deleted or blocked |
| T+30s | Health endpoint still may pass |
| T+90s | Storage calls shift to timeout |
| T+180s | Full degradation visible |
| T+240s | Timeout then 502 pattern |
| DNS observation | Result |
| Private zone records | Stale private IPs remained |
| Interpretation risk | Appears DNS-correct while path is broken |
FC1 recovery notes
| Recovery action | Expected result |
| Restore RBAC or identity | Function begins recovering after propagation |
| Restore DNS/network path | Storage endpoints recover first; health may have stayed green throughout |
| Final validation | /api/blob/read and /api/blob/write return to 200/201 |
4.3.2 EP1 (Premium)
Baseline
| Signal | Observation |
/api/health | 200 |
/api/blob/read | 200 |
/api/blob/write | 200/201 |
| DNS private endpoints | Healthy private resolution |
| Overall state | Healthy baseline |
Fault A: RBAC removal
| Relative time | Observation |
| T+00s | RBAC role removed |
| T+60s | Growing failures under load |
| T+120s | Runtime instability visible |
| T+180s | 503 page: Function host is not running |
| T+240s | Stable failed state |
| Signature class | Evidence |
| External symptom | 503 Function host is not running |
| Signal clarity | High |
| Layer hint | Runtime-host startup dependency failure |
Fault B: identity removal
| Relative time | Observation |
| T+00s | Identity removed |
| T+120s | All endpoints still working |
| T+240s | Still operational in initial window |
| Post-restart | Failures manifest |
| Signature class | Evidence |
| External symptom pre-restart | Healthy |
| External symptom post-restart | Timeout / unavailable behavior |
| Diagnostic difficulty | Highest in matrix |
Fault C: DNS break
| Relative time | Observation |
| T+00s | DNS path broken |
| T+60s | Degraded startup behavior |
| Post-restart | 503 page: :( Application Error |
| T+180s | Persistent application error page |
| Signature class | Evidence |
| External symptom | 503 :( Application Error |
| Differentiation | Distinct from RBAC 503 page |
| Interpretation | DNS/network class likely, not pure RBAC |
Fault D: network break
| Relative time | Observation |
| T+00s | Private endpoint path broken |
| T+60s | Degradation starts |
| T+120s | Request failures increase |
| T+180s | 503 :( Application Error |
| Signature class | Evidence |
| External symptom | Same as DNS break |
| Disambiguation need | Must inspect DNS and endpoint topology state |
EP1 recovery notes
| Recovery action | Expected result |
| Restore RBAC | Function host is not running clears after propagation and recycle |
| Reattach identity + restart | Restores token acquisition path |
| Restore DNS/network and restart | Clears :( Application Error in this lab pattern |
4.3.3 Y1 (Consumption)
Baseline
| Signal | Observation |
/api/health | 200 |
/api/blob/read | 200 |
/api/blob/write | 200/201 |
| Networking model | Public only (no VNet, no PE) |
| Overall state | Healthy baseline |
Fault A: RBAC removal
| Relative time | Observation |
| T+00s | RBAC role removed |
| T+60s | Read/write begin failing |
| T+120s | 502 generic Server Error page |
| T+240s | Persistent generic 502 |
| Signature class | Evidence |
| External symptom | 502 generic Server Error |
| Signal clarity | Lower than EP1/S1 explicit host message |
Fault B: identity removal
| Relative time | Observation |
| T+00s | Identity removed |
| T+60s | Degraded storage access |
| T+120s | 502 generic Server Error |
| T+180s | Persistent failure |
| Signature class | Evidence |
| External symptom | 502 generic Server Error |
| Restart requirement | Not required to expose in this run |
Fault C: DNS break
| Field | Observation |
| Applicability | Not applicable |
| Reason | No VNet/private DNS dependency path |
| Test result | Skipped by design |
Fault D: network break
| Field | Observation |
| Applicability | Not applicable |
| Reason | No PE/VNet data path in this architecture |
| Test result | Skipped by design |
Y1 recovery notes
| Recovery action | Expected result |
| Restore RBAC role | Endpoint success returns after propagation |
| Re-enable identity | 502 generic errors clear |
| Publish caveat | func publish may require temporary connection-string workaround |
4.3.4 Dedicated S1
Baseline
| Signal | Observation |
/api/health | 200 |
/api/blob/read | 200 |
/api/blob/write | 200/201 |
| DNS private endpoints | Private resolution for blob/queue/table |
| Architecture note | No file PE in this run (run-from-package) |
Fault A: RBAC removal
| Relative time | Observation |
| T+00s | RBAC role removed |
| T+60s | Partial instability |
| T+120s | 503 Function host is not running |
| T+240s | Stable failed state |
| Signature class | Evidence |
| External symptom | 503 Function host is not running |
| Similarity | Matches EP1 for this fault |
Fault B: identity removal
| Relative time | Observation |
| T+00s | Identity removed |
| T+120s | Service still appears healthy |
| T+240s | Often still healthy in initial window |
| Post-restart | HTTP 000 timeout and outage appears |
| Signature class | Evidence |
| External symptom pre-restart | Healthy |
| External symptom post-restart | HTTP 000 timeout/failure |
| Diagnostic challenge | Cache-masked identity drift |
Fault C: DNS break
| Relative time | Observation |
| T+00s | DNS path broken |
| T+60s | Startup degradation |
| T+120s | 503 Function host is not running |
| T+240s | Persistent 503 host-not-running |
| Signature class | Evidence |
| External symptom | Same 503 host-not-running signature as RBAC |
| Disambiguation requirement | Must check DNS and role assignment together |
Fault D: network break
| Relative time | Observation |
| T+00s | PE deleted or blocked |
| T+60s | DNS begins changing |
| T+120s | DNS reverts to public IPs (20.150.x.x class observed) |
| T+180s | 503 Function host is not running |
| DNS observation | Result |
| Re-resolution target | Public storage endpoint IPs |
| Why still failing | Storage firewall blocked public path |
S1 recovery notes
| Recovery action | Expected result |
| Restore RBAC | Host resumes after propagation |
| Reattach identity + restart | Clears post-restart timeout behavior |
| Restore PE/network path | 503 host-not-running clears |
4.4 Matrix verdict
| Hypothesis component | Status | Why |
| Distinct failure signatures by plan | Supported | Signature matrix shows strong asymmetry |
| Distinct propagation timelines by plan | Supported | Identity removal timing diverges across recycle and always-on plans |
| Distinct diagnostic indicators by plan | Supported | EP1 page split, FC1 health blind spot, Y1 applicability boundary |
| Need for plan-specific runbooks | Supported | Generic path would miss at least one high-risk fault class |
Expected Evidence
Before Trigger (Baseline)
| Plan | Endpoint baseline | Storage baseline | DNS baseline |
| FC1 | /api/health, /api/blob/read, /api/blob/write return success | Storage operations succeed via MI | Private IP resolution for blob/queue/table/file |
| EP1 | Same success profile | Storage operations succeed via MI | Private IP resolution for blob/queue/table/file |
| Y1 | Same success profile | Storage operations succeed via MI | Public endpoint resolution only |
| S1 | Same success profile | Storage operations succeed via MI | Private IP resolution for blob/queue/table |
During Incident
During RBAC removal
| Plan | Expected incident evidence |
| FC1 | Timeout/502 mix appears after short cache window |
| EP1 | 503 Function host is not running around T+180s |
| Y1 | 502 generic Server Error |
| S1 | 503 Function host is not running |
During identity removal
| Plan | Expected incident evidence |
| FC1 | Full timeout within ~2 minutes |
| EP1 | Healthy initially; fails after restart |
| Y1 | 502 within ~2 minutes |
| S1 | Healthy initially; fails after restart |
During DNS break
| Plan | Expected incident evidence |
| FC1 | Health endpoint still 200 while storage operations timeout |
| EP1 | 503 :( Application Error |
| Y1 | Not applicable |
| S1 | 503 Function host is not running |
During network break
| Plan | Expected incident evidence |
| FC1 | HTTP 000 timeout then 502 |
| EP1 | 503 :( Application Error |
| Y1 | Not applicable |
| S1 | 503 host-not-running with rapid DNS reversion to public IPs |
After Recovery
| Verification step | FC1 | EP1 | Y1 | S1 |
| Restore faulted control (RBAC/identity/DNS/network) | Required | Required | Required for applicable faults | Required |
| Restart recommendation | Optional by fault | Recommended for deterministic validation | Usually not needed | Recommended |
/api/health success | Expected | Expected | Expected | Expected |
/api/blob/read success | Expected | Expected | Expected | Expected |
/api/blob/write success | Expected | Expected | Expected | Expected |
Evidence Timeline
gantt
title Hosting Plan Security Matrix Fault Timeline
dateFormat X
axisFormat T+%Ss
section Baseline
All plans healthy :done, b1, 0, 30
section Fault Injection Window
RBAC removal :done, f1, 30, 30
Identity removal :done, f2, 90, 30
DNS break ("FC1/EP1/S1") :done, f3, 150, 30
Network break ("FC1/EP1/S1") :done, f4, 210, 30
section Distinct Emergence
FC1 fast timeout behavior :crit, e1, 120, 120
EP1/S1 cache-masked identity interval :active, e2, 120, 180
Y1 generic 502 behavior :crit, e3, 150, 90
EP1 page differentiation visible :active, e4, 180, 120
section Recovery
Controls restored + validation probes :done, r1, 360, 120
Evidence Chain: Why This Proves the Hypothesis
- The same four fault classes were injected across all four plans.
- The observable signatures diverged by plan (timeouts, 502 generic, host-not-running 503, application-error 503).
- Identity removal showed plan-specific propagation dynamics (cache-masked vs near-immediate).
- DNS and network classes were structurally non-applicable to
Y1, confirming architecture boundaries. - Recovery required different emphasis per plan (especially restart strategy and DNS/path checks), confirming plan-specific runbook necessity.
Operational Recommendations
- Implement storage-specific health checks in addition to
/api/health so FC1 DNS-invisible failures are detected. - Alert on storage dependency latency and timeout spikes, not just HTTP status counts.
- Treat identity changes on
EP1 and S1 as incomplete until validated after controlled restart. - Add continuous DNS resolution monitoring for private endpoint environments and detect stale private records.
- Maintain plan-specific incident runbooks and route triage by hosting plan before root-cause branching.
Suggested alert mapping
| Alert condition | Recommended severity | Plans |
| Storage dependency timeout rate > baseline threshold | High | FC1, EP1, S1, Y1 |
/api/health success but storage endpoint failure | Critical | FC1 |
EP1 503 page text contains Function host is not running | High | EP1 |
EP1 503 page text contains :( Application Error | High | EP1 |
| S1 503 host-not-running + DNS switch to public IP | High | S1 |
| Y1 repeated 502 generic with RBAC drift event | High | Y1 |
Clean Up
Delete all lab resource groups:
az group delete --name "$RG_FC1" --yes --no-wait
az group delete --name "$RG_EP1" --yes --no-wait
az group delete --name "$RG_Y1" --yes --no-wait
az group delete --name "$RG_S1" --yes --no-wait
See Also
Sources