Hands-on Labs¶

These labs let you practice incident response on reproducible Azure Functions failure scenarios. Run labs in a non-production environment and treat them like live incidents: detect, triage, diagnose, fix, and verify.

graph TD
    A[Hands-on Labs] --> B[Performance]
    A --> C["Storage / Identity"]
    A --> D[Network]
    A --> E["Execution / Runtime"]
    A --> F[Event Processing]
    B --> B1[Cold Start]
    B --> B2[Queue Backlog Scaling]
    C --> C1[Storage Access Failure]
    C --> C2[Managed Identity Auth]
    D --> D1["DNS / VNet Resolution"]
    E --> E1[Out of Memory Crash]
    E --> E2[Deployment Not Running]
    E --> E3[Durable Replay Storm]
    F --> F1[Timer Missed Schedules]
    F --> F2[Event Hub Checkpoint Lag]

How Labs Work¶

Each lab includes:

Lab infrastructure — Bicep templates and app source in labs/ directory
Documentation page — Step-by-step walkthrough with KQL queries and expected observations
Expected evidence — Baseline, during-incident, and after-recovery evidence to validate your investigation

Available Labs¶

Performance¶

Lab	Symptom	Related Playbook
Cold Start	Elevated first-request latency after idle periods	High Latency / Slow Responses
Queue Backlog Scaling	Queue depth grows faster than processing throughput	Queue Messages Piling Up

Storage / Identity¶

Lab	Symptom	Related Playbook
Storage Access Failure	Triggers stop processing due to storage auth or connectivity issues	Functions Not Executing
Managed Identity Auth	Managed identity calls fail after RBAC or scope changes	Functions Failing with Errors

Network¶

Lab	Symptom	Related Playbook
DNS / VNet Resolution	Function app cannot resolve or reach private dependencies	Blob Trigger Not Firing

Execution / Runtime¶

Lab	Symptom	Related Playbook
Out of Memory Crash	Workers crash under memory pressure with large payloads	Out of Memory / Worker Crash
Deployment Not Running	Deployment succeeds but functions never execute	Deployment Failures
Durable Replay Storm	Durable orchestrations replay excessively with growing latency	Durable Orchestration Stuck

Event Processing¶

Lab	Symptom	Related Playbook
Timer Missed Schedules	Timer triggers miss scheduled executions after idle	Timeout / Execution Limit
Event Hub Checkpoint Lag	Event Hub processing falls behind and checkpoint lag grows	Event Hub / Service Bus Lag

Prerequisites¶

All labs require:

Azure subscription with Contributor access
Azure CLI installed and logged in (az login)
Bash shell (Linux, macOS, or WSL)

General Workflow¶

# 1. Create resource group
az group create --name rg-lab-<name> --location koreacentral

# 2. Deploy infrastructure
az deployment group create \
  --resource-group rg-lab-<name> \
  --template-file labs/<name>/main.bicep \
  --parameters baseName=lab<short>

# 3. Deploy app code (zip deploy)
# 4. Trigger the failure scenario
# 5. Wait 2-5 minutes for logs to appear
# 6. Investigate using playbooks and KQL queries

# 7. Clean up
az group delete --name rg-lab-<name> --yes --no-wait

Cost

Each lab deploys Azure Functions resources. Delete the resource group after completing the lab to avoid ongoing charges.

Recommended Learning Sequence¶

Start with broad reliability issues, then move into specialized scenarios:

Cold Start — Understand cold start vs dependency latency
Queue Backlog Scaling — Backlog triage and throughput analysis
Storage Access Failure — Storage auth and host errors
Managed Identity Auth — RBAC and identity troubleshooting
DNS / VNet Resolution — Network path and DNS diagnosis
Out of Memory Crash — Memory limits and worker recycling
Deployment Not Running — Successful deploy with no function execution
Durable Replay Storm — Orchestration replay performance
Timer Missed Schedules — Missed timer executions
Event Hub Checkpoint Lag — Checkpoint lag and throughput

Practice Checklist¶

For each lab, confirm your team can do all of the following without guesswork:

Detect the issue from alerts or dashboard anomalies.
Execute the First 10 Minutes checklist.
Select the right Playbook and isolate likely causes.
Run 2-3 focused KQL queries from KQL Library.
Apply a minimal fix and verify recovery in telemetry.
Document root cause and prevention tasks.

Evidence Collection Skills¶

Each lab trains specific diagnostic skills:

Mapping Labs to Common Production Incidents¶

Incident type	Best lab
Latency regression after idle periods	Cold Start
Trigger pipeline stalls	Storage Access Failure
Event ingestion cannot keep up	Queue Backlog Scaling
Private endpoint dependency outages	DNS / VNet Resolution
RBAC / identity breakages	Managed Identity Auth
Worker crashes under load	Out of Memory Crash
Deploy succeeded but nothing runs	Deployment Not Running
Orchestration latency grows over time	Durable Replay Storm
Scheduled tasks not running on time	Timer Missed Schedules
Event stream processing falling behind	Event Hub Checkpoint Lag