Skip to content

Hands-on Labs

These labs let you practice incident response on reproducible Azure Functions failure scenarios. Run labs in a non-production environment and treat them like live incidents: detect, triage, diagnose, fix, and verify.

graph TD
    A[Hands-on Labs] --> B[Performance]
    A --> C["Storage / Identity"]
    A --> D[Network]
    A --> E["Execution / Runtime"]
    A --> F[Event Processing]
    B --> B1[Cold Start]
    B --> B2[Queue Backlog Scaling]
    C --> C1[Storage Access Failure]
    C --> C2[Managed Identity Auth]
    D --> D1["DNS / VNet Resolution"]
    E --> E1[Out of Memory Crash]
    E --> E2[Deployment Not Running]
    E --> E3[Durable Replay Storm]
    F --> F1[Timer Missed Schedules]
    F --> F2[Event Hub Checkpoint Lag]

How Labs Work

Each lab includes:

  1. Lab infrastructure — Bicep templates and app source in labs/ directory
  2. Documentation page — Step-by-step walkthrough with KQL queries and expected observations
  3. Expected evidence — Baseline, during-incident, and after-recovery evidence to validate your investigation

Available Labs

Performance

Lab Symptom Related Playbook
Cold Start Elevated first-request latency after idle periods High Latency / Slow Responses
Queue Backlog Scaling Queue depth grows faster than processing throughput Queue Messages Piling Up

Storage / Identity

Lab Symptom Related Playbook
Storage Access Failure Triggers stop processing due to storage auth or connectivity issues Functions Not Executing
Managed Identity Auth Managed identity calls fail after RBAC or scope changes Functions Failing with Errors

Network

Lab Symptom Related Playbook
DNS / VNet Resolution Function app cannot resolve or reach private dependencies Blob Trigger Not Firing

Execution / Runtime

Lab Symptom Related Playbook
Out of Memory Crash Workers crash under memory pressure with large payloads Out of Memory / Worker Crash
Deployment Not Running Deployment succeeds but functions never execute Deployment Failures
Durable Replay Storm Durable orchestrations replay excessively with growing latency Durable Orchestration Stuck

Event Processing

Lab Symptom Related Playbook
Timer Missed Schedules Timer triggers miss scheduled executions after idle Timeout / Execution Limit
Event Hub Checkpoint Lag Event Hub processing falls behind and checkpoint lag grows Event Hub / Service Bus Lag

Prerequisites

All labs require:

  • Azure subscription with Contributor access
  • Azure CLI installed and logged in (az login)
  • Bash shell (Linux, macOS, or WSL)

General Workflow

# 1. Create resource group
az group create --name rg-lab-<name> --location koreacentral

# 2. Deploy infrastructure
az deployment group create \
  --resource-group rg-lab-<name> \
  --template-file labs/<name>/main.bicep \
  --parameters baseName=lab<short>

# 3. Deploy app code (zip deploy)
# 4. Trigger the failure scenario
# 5. Wait 2-5 minutes for logs to appear
# 6. Investigate using playbooks and KQL queries

# 7. Clean up
az group delete --name rg-lab-<name> --yes --no-wait

Cost

Each lab deploys Azure Functions resources. Delete the resource group after completing the lab to avoid ongoing charges.

Start with broad reliability issues, then move into specialized scenarios:

  1. Cold Start — Understand cold start vs dependency latency
  2. Queue Backlog Scaling — Backlog triage and throughput analysis
  3. Storage Access Failure — Storage auth and host errors
  4. Managed Identity Auth — RBAC and identity troubleshooting
  5. DNS / VNet Resolution — Network path and DNS diagnosis
  6. Out of Memory Crash — Memory limits and worker recycling
  7. Deployment Not Running — Successful deploy with no function execution
  8. Durable Replay Storm — Orchestration replay performance
  9. Timer Missed Schedules — Missed timer executions
  10. Event Hub Checkpoint Lag — Checkpoint lag and throughput

Practice Checklist

For each lab, confirm your team can do all of the following without guesswork:

  • Detect the issue from alerts or dashboard anomalies.
  • Execute the First 10 Minutes checklist.
  • Select the right Playbook and isolate likely causes.
  • Run 2-3 focused KQL queries from KQL Library.
  • Apply a minimal fix and verify recovery in telemetry.
  • Document root cause and prevention tasks.

Evidence Collection Skills

Each lab trains specific diagnostic skills:

| Lab | Primary Skill | Secondary Skill | | Cold Start | Correlating host startup with request latency | Reading trace timeline | | Storage Access Failure | Identifying auth errors in host logs | Verifying RBAC with CLI | | Queue Backlog Scaling | Reading queue metrics vs execution metrics | Identifying poison message loops | | DNS / VNet Resolution | Diagnosing DNS errors in dependency calls | Verifying private DNS zone configuration | | Managed Identity Auth | Tracing RBAC changes in activity log | Correlating exceptions with config changes | | Out of Memory Crash | Detecting OOM exceptions and worker restarts | Correlating memory pressure with payload size | | Deployment Not Running | Reading function discovery logs | Validating project structure and runtime config | | Durable Replay Storm | Measuring replay duration growth | Identifying orchestration history bloat | | Timer Missed Schedules | Verifying timer execution gaps | Using isPastDue and RunOnStartup parameters | | Event Hub Checkpoint Lag | Measuring checkpoint offset vs partition tail | Tuning batch size and prefetch settings |

Mapping Labs to Common Production Incidents

Incident type Best lab
Latency regression after idle periods Cold Start
Trigger pipeline stalls Storage Access Failure
Event ingestion cannot keep up Queue Backlog Scaling
Private endpoint dependency outages DNS / VNet Resolution
RBAC / identity breakages Managed Identity Auth
Worker crashes under load Out of Memory Crash
Deploy succeeded but nothing runs Deployment Not Running
Orchestration latency grows over time Durable Replay Storm
Scheduled tasks not running on time Timer Missed Schedules
Event stream processing falling behind Event Hub Checkpoint Lag

See Also

Sources