Hands-on Labs¶
These labs let you practice incident response on reproducible Azure Functions failure scenarios. Run labs in a non-production environment and treat them like live incidents: detect, triage, diagnose, fix, and verify.
graph TD
A[Hands-on Labs] --> B[Performance]
A --> C["Storage / Identity"]
A --> D[Network]
A --> E["Execution / Runtime"]
A --> F[Event Processing]
B --> B1[Cold Start]
B --> B2[Queue Backlog Scaling]
C --> C1[Storage Access Failure]
C --> C2[Managed Identity Auth]
D --> D1["DNS / VNet Resolution"]
E --> E1[Out of Memory Crash]
E --> E2[Deployment Not Running]
E --> E3[Durable Replay Storm]
F --> F1[Timer Missed Schedules]
F --> F2[Event Hub Checkpoint Lag] How Labs Work¶
Each lab includes:
- Lab infrastructure — Bicep templates and app source in
labs/directory - Documentation page — Step-by-step walkthrough with KQL queries and expected observations
- Expected evidence — Baseline, during-incident, and after-recovery evidence to validate your investigation
Available Labs¶
Performance¶
| Lab | Symptom | Related Playbook |
|---|---|---|
| Cold Start | Elevated first-request latency after idle periods | High Latency / Slow Responses |
| Queue Backlog Scaling | Queue depth grows faster than processing throughput | Queue Messages Piling Up |
Storage / Identity¶
| Lab | Symptom | Related Playbook |
|---|---|---|
| Storage Access Failure | Triggers stop processing due to storage auth or connectivity issues | Functions Not Executing |
| Managed Identity Auth | Managed identity calls fail after RBAC or scope changes | Functions Failing with Errors |
Network¶
| Lab | Symptom | Related Playbook |
|---|---|---|
| DNS / VNet Resolution | Function app cannot resolve or reach private dependencies | Blob Trigger Not Firing |
Execution / Runtime¶
| Lab | Symptom | Related Playbook |
|---|---|---|
| Out of Memory Crash | Workers crash under memory pressure with large payloads | Out of Memory / Worker Crash |
| Deployment Not Running | Deployment succeeds but functions never execute | Deployment Failures |
| Durable Replay Storm | Durable orchestrations replay excessively with growing latency | Durable Orchestration Stuck |
Event Processing¶
| Lab | Symptom | Related Playbook |
|---|---|---|
| Timer Missed Schedules | Timer triggers miss scheduled executions after idle | Timeout / Execution Limit |
| Event Hub Checkpoint Lag | Event Hub processing falls behind and checkpoint lag grows | Event Hub / Service Bus Lag |
Prerequisites¶
All labs require:
- Azure subscription with Contributor access
- Azure CLI installed and logged in (
az login) - Bash shell (Linux, macOS, or WSL)
General Workflow¶
# 1. Create resource group
az group create --name rg-lab-<name> --location koreacentral
# 2. Deploy infrastructure
az deployment group create \
--resource-group rg-lab-<name> \
--template-file labs/<name>/main.bicep \
--parameters baseName=lab<short>
# 3. Deploy app code (zip deploy)
# 4. Trigger the failure scenario
# 5. Wait 2-5 minutes for logs to appear
# 6. Investigate using playbooks and KQL queries
# 7. Clean up
az group delete --name rg-lab-<name> --yes --no-wait
Cost
Each lab deploys Azure Functions resources. Delete the resource group after completing the lab to avoid ongoing charges.
Recommended Learning Sequence¶
Start with broad reliability issues, then move into specialized scenarios:
- Cold Start — Understand cold start vs dependency latency
- Queue Backlog Scaling — Backlog triage and throughput analysis
- Storage Access Failure — Storage auth and host errors
- Managed Identity Auth — RBAC and identity troubleshooting
- DNS / VNet Resolution — Network path and DNS diagnosis
- Out of Memory Crash — Memory limits and worker recycling
- Deployment Not Running — Successful deploy with no function execution
- Durable Replay Storm — Orchestration replay performance
- Timer Missed Schedules — Missed timer executions
- Event Hub Checkpoint Lag — Checkpoint lag and throughput
Practice Checklist¶
For each lab, confirm your team can do all of the following without guesswork:
- Detect the issue from alerts or dashboard anomalies.
- Execute the First 10 Minutes checklist.
- Select the right Playbook and isolate likely causes.
- Run 2-3 focused KQL queries from KQL Library.
- Apply a minimal fix and verify recovery in telemetry.
- Document root cause and prevention tasks.
Evidence Collection Skills¶
Each lab trains specific diagnostic skills:
| Lab | Primary Skill | Secondary Skill | | Cold Start | Correlating host startup with request latency | Reading trace timeline | | Storage Access Failure | Identifying auth errors in host logs | Verifying RBAC with CLI | | Queue Backlog Scaling | Reading queue metrics vs execution metrics | Identifying poison message loops | | DNS / VNet Resolution | Diagnosing DNS errors in dependency calls | Verifying private DNS zone configuration | | Managed Identity Auth | Tracing RBAC changes in activity log | Correlating exceptions with config changes | | Out of Memory Crash | Detecting OOM exceptions and worker restarts | Correlating memory pressure with payload size | | Deployment Not Running | Reading function discovery logs | Validating project structure and runtime config | | Durable Replay Storm | Measuring replay duration growth | Identifying orchestration history bloat | | Timer Missed Schedules | Verifying timer execution gaps | Using isPastDue and RunOnStartup parameters | | Event Hub Checkpoint Lag | Measuring checkpoint offset vs partition tail | Tuning batch size and prefetch settings |
Mapping Labs to Common Production Incidents¶
| Incident type | Best lab |
|---|---|
| Latency regression after idle periods | Cold Start |
| Trigger pipeline stalls | Storage Access Failure |
| Event ingestion cannot keep up | Queue Backlog Scaling |
| Private endpoint dependency outages | DNS / VNet Resolution |
| RBAC / identity breakages | Managed Identity Auth |
| Worker crashes under load | Out of Memory Crash |
| Deploy succeeded but nothing runs | Deployment Not Running |
| Orchestration latency grows over time | Durable Replay Storm |
| Scheduled tasks not running on time | Timer Missed Schedules |
| Event stream processing falling behind | Event Hub Checkpoint Lag |