Health Check Eviction on Partial Dependency Failure¶
Status: Published
Experiment completed with real data collected on 2026-04-10 from Azure App Service P1v3 (koreacentral). Four test scenarios executed across 2 instances over ~40 minutes. All hypotheses partially confirmed — with a critical nuance discovered.
1. Question¶
When an App Service health check endpoint returns unhealthy because a single downstream dependency (e.g., database) is unreachable, does the platform evict the instance even though the application itself is running and could serve requests that don't require that dependency?
2. Why this matters¶
Customers implement health check endpoints that validate all dependencies. When one dependency fails, the health check returns unhealthy, and the platform removes the instance from the load balancer rotation. This can cascade — if the unhealthy dependency affects all instances equally, every instance gets evicted, causing a full outage for a partial dependency failure.
Background: How Health Check Eviction Works¶
Azure App Service Health Check probes each instance every 1 minute at a configured path. When an instance returns a non-200 status code for a threshold of consecutive checks, the platform marks the instance as unhealthy and removes it from load balancer rotation.
┌─────────────────────────────────────────────────────┐
│ App Service Load Balancer │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Instance A │ │ Instance B │ │
│ │ /healthz→503 │ │ /healthz→200 │ │
│ │ (UNHEALTHY) │ │ (HEALTHY) │ │
│ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │
│ ✗ Evicted after ✓ Receives 100% │
│ ~10 minutes of traffic │
└─────────────────────────────────────────────────────┘
Critical design constraint: If ALL instances are unhealthy, the platform does NOT evict any instance. This prevents cascading eviction from turning a partial failure into a total outage.
3. Customer symptom¶
- "Our app went completely down, but only the database was unreachable for 2 minutes."
- "Health check keeps failing and instances keep getting removed and re-added."
- "We see instance cycling in the health check blade even though the app is fine."
- "After we fixed the dependency, it still took 10 minutes for the instance to come back."
4. Hypothesis¶
- H1 — Partial eviction: When a single instance fails health checks while others are healthy, the platform evicts only the unhealthy instance after ~10 consecutive failures (~10 minutes).
- H2 — Total failure protection: When ALL instances fail health checks simultaneously, the platform does NOT evict any instance — all instances continue receiving traffic.
- H3 — Cascading amplification: If one instance is evicted and the last remaining instance also becomes unhealthy, the platform keeps the last instance in rotation (never reduces to zero).
- H4 — Recovery: After restart, evicted instances re-enter rotation within 1-2 minutes.
5. Environment¶
| Parameter | Value |
|---|---|
| Service | Azure App Service |
| SKU / Plan | P1v3 (2 instances) |
| Region | Korea Central |
| Runtime | Python 3.11 |
| OS | Linux |
| Deployment method | ZIP Deploy |
| ARR Affinity | Disabled |
| Health Check Path | /healthz |
| Health Check Interval | 1 minute (platform default) |
| Date tested | 2026-04-10 |
Instances:
| Instance | Short ID | Hostname | Availability Zone |
|---|---|---|---|
| A | 4b6100b2a00e |
c7cfde03186d |
koreacentral-az3 |
| B | ed56515e4ed9 |
defc2b4c67d1 |
koreacentral-az2 |
6. Variables¶
Experiment type: Config
Controlled:
- Health check path (
/healthz) and response logic (200 if dependency healthy, 503 if not) - Dependency failure simulation (in-memory per-instance flag)
- Instance count (2, fixed P1v3 plan)
- Which instance(s) have failed dependency
Observed:
- Traffic distribution across instances (30 requests per measurement window)
- Instance state via
az webapp list-instances(READY, STOPPED, UNKNOWN) - Time from first health check failure to eviction
- Behavior when all instances are unhealthy
- Behavior when last healthy instance fails (cascading scenario)
- Recovery time after
az webapp restart
7. Instrumentation¶
- Test application: Custom Flask app with simulated dependency and health check logging
- Traffic measurement: 30 sequential HTTP requests to
/statusper check, counting instance distribution - Instance state:
az webapp list-instances --query "[].{name:name, state:state}"— reports READY, STOPPED, or UNKNOWN - Monitoring interval: Every 2 minutes for up to 16 minutes per test
- Failure control: POST to
/fail-dependencytargets a specific instance; if wrong instance is hit, immediately POST/recover-dependencyand retry
8. Procedure¶
8.1 Application Code¶
app.py¶
"""
Health Check Eviction Test App for Azure App Service.
This app simulates a health check endpoint that depends on an external dependency.
The dependency can be toggled to simulate failure, triggering health check eviction.
"""
import os
import socket
import time
import threading
from datetime import datetime, timezone
from flask import Flask, jsonify, request
app = Flask(__name__)
# Simulated dependency state - shared across requests on this instance
dependency_state = {
"healthy": True,
"failed_since": None,
"failure_count": 0,
"recovery_time": None,
}
# Health check call log - track each health check probe
health_check_log = []
MAX_LOG_SIZE = 500
# Request log - track all requests
request_log = []
MAX_REQUEST_LOG = 500
# Lock for thread safety
state_lock = threading.Lock()
def _get_instance_id():
return os.environ.get(
"WEBSITE_INSTANCE_ID",
os.environ.get("COMPUTERNAME", socket.gethostname()),
)
def _log_health_check(status_code, reason):
with state_lock:
entry = {
"timestamp_utc": datetime.now(timezone.utc).isoformat(),
"instance_id": _get_instance_id()[:16],
"hostname": socket.gethostname(),
"status_code": status_code,
"reason": reason,
"dependency_healthy": dependency_state["healthy"],
}
health_check_log.append(entry)
if len(health_check_log) > MAX_LOG_SIZE:
health_check_log.pop(0)
def _log_request(endpoint, status_code):
with state_lock:
entry = {
"timestamp_utc": datetime.now(timezone.utc).isoformat(),
"instance_id": _get_instance_id()[:16],
"hostname": socket.gethostname(),
"endpoint": endpoint,
"status_code": status_code,
"pid": os.getpid(),
}
request_log.append(entry)
if len(request_log) > MAX_REQUEST_LOG:
request_log.pop(0)
@app.route("/")
def index():
"""Root endpoint - always responds (not tied to dependency)."""
_log_request("/", 200)
return jsonify({
"status": "ok",
"instance_id": _get_instance_id(),
"hostname": socket.gethostname(),
"pid": os.getpid(),
"timestamp_utc": datetime.now(timezone.utc).isoformat(),
"dependency_healthy": dependency_state["healthy"],
})
@app.route("/healthz")
def health_check():
"""Health check endpoint that depends on simulated external service.
Returns 200 if dependency is healthy, 503 if dependency is down.
This is the endpoint configured in App Service Health Check.
"""
if dependency_state["healthy"]:
_log_health_check(200, "dependency_healthy")
return jsonify({
"status": "healthy",
"instance_id": _get_instance_id()[:16],
"hostname": socket.gethostname(),
"dependency": "connected",
"timestamp_utc": datetime.now(timezone.utc).isoformat(),
}), 200
else:
_log_health_check(503, "dependency_unavailable")
return jsonify({
"status": "unhealthy",
"instance_id": _get_instance_id()[:16],
"hostname": socket.gethostname(),
"dependency": "unreachable",
"failed_since": dependency_state["failed_since"],
"failure_count": dependency_state["failure_count"],
"timestamp_utc": datetime.now(timezone.utc).isoformat(),
}), 503
@app.route("/api/data")
def api_data():
"""Normal API endpoint that doesn't need the dependency."""
_log_request("/api/data", 200)
return jsonify({
"data": "This endpoint works regardless of dependency status",
"instance_id": _get_instance_id()[:16],
"hostname": socket.gethostname(),
"dependency_healthy": dependency_state["healthy"],
"timestamp_utc": datetime.now(timezone.utc).isoformat(),
})
@app.route("/fail-dependency", methods=["POST"])
def fail_dependency():
"""Simulate dependency failure."""
with state_lock:
dependency_state["healthy"] = False
dependency_state["failed_since"] = datetime.now(timezone.utc).isoformat()
dependency_state["failure_count"] = 0
return jsonify({
"action": "dependency_failed",
"instance_id": _get_instance_id()[:16],
"hostname": socket.gethostname(),
"timestamp_utc": datetime.now(timezone.utc).isoformat(),
})
@app.route("/recover-dependency", methods=["POST"])
def recover_dependency():
"""Simulate dependency recovery."""
with state_lock:
dependency_state["healthy"] = True
dependency_state["recovery_time"] = datetime.now(timezone.utc).isoformat()
return jsonify({
"action": "dependency_recovered",
"instance_id": _get_instance_id()[:16],
"hostname": socket.gethostname(),
"timestamp_utc": datetime.now(timezone.utc).isoformat(),
})
@app.route("/status")
def status():
"""Get current instance status and health check log."""
return jsonify({
"instance_id": _get_instance_id(),
"hostname": socket.gethostname(),
"pid": os.getpid(),
"dependency_state": dependency_state,
"health_check_log_count": len(health_check_log),
"health_check_log_last_10": health_check_log[-10:],
"request_log_count": len(request_log),
"request_log_last_10": request_log[-10:],
"timestamp_utc": datetime.now(timezone.utc).isoformat(),
})
@app.route("/logs/healthcheck")
def healthcheck_logs():
"""Get full health check log."""
return jsonify({
"instance_id": _get_instance_id()[:16],
"hostname": socket.gethostname(),
"total_entries": len(health_check_log),
"entries": health_check_log,
})
@app.route("/logs/requests")
def request_logs():
"""Get full request log."""
return jsonify({
"instance_id": _get_instance_id()[:16],
"hostname": socket.gethostname(),
"total_entries": len(request_log),
"entries": request_log,
})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)
requirements.txt¶
Design Notes¶
- In-memory dependency simulation: Uses a thread-safe dictionary (
dependency_state) withthreading.Lockto simulate an external dependency that can be toggled healthy/unhealthy per-instance. This avoids needing a real database while allowing per-instance control. - Per-instance state isolation: Each App Service instance runs its own container with its own process -
dependency_stateis process-local. This means POST/fail-dependencyon Instance A does not affect Instance B, which is exactly what we need to test partial failure scenarios. - Health check logging: Every
/healthzprobe is logged with timestamp, instance ID, hostname, and result. This creates an audit trail to correlate with platform health check decisions (eviction timing). - Truncated instance ID:
_get_instance_id()[:16]is used because Azure'sWEBSITE_INSTANCE_IDis a 64-character hex string; the first 16 chars are sufficient for visual differentiation in logs. - Request logging: All endpoints log requests to an in-memory list (capped at 500 entries) for post-test forensic analysis without needing Application Insights.
- gunicorn single-worker: The startup command uses
gunicorn --bind=0.0.0.0 --timeout 600 app:appwith the default 1 worker because thread safety of the shareddependency_statedict is simpler with a single process.
Endpoint Map¶
| Endpoint | Method | Purpose | Hypothesis Link | Response |
|---|---|---|---|---|
/ |
GET | Root - always returns 200 regardless of dependency state | Baseline - confirms app is running | {"status": "ok", "instance_id": "...", "dependency_healthy": true/false} |
/healthz |
GET | Health check endpoint configured in App Service | Tests H1-H3 - platform probes this every 1 minute; returns 503 when dependency is failed | 200 + {"status": "healthy"} or 503 + {"status": "unhealthy"} |
/api/data |
GET | Normal API endpoint that works without dependency | Shows that the app can serve requests even when health check fails - illustrates the cascading eviction problem | {"data": "This endpoint works regardless..."} |
/fail-dependency |
POST | Toggles dependency to unhealthy state | Triggers health check failure on the targeted instance | {"action": "dependency_failed", "instance_id": "..."} |
/recover-dependency |
POST | Restores dependency to healthy state | Ends the failure simulation for recovery testing (H4) | {"action": "dependency_recovered", "instance_id": "..."} |
/status |
GET | Returns full instance state including health check log | Forensic analysis - shows dependency state, last 10 health checks and requests | Full JSON with dependency_state, logs |
/logs/healthcheck |
GET | Returns complete health check probe log | Correlates platform probe timing with eviction decisions | {"entries": [...]} |
/logs/requests |
GET | Returns complete request log | Tracks traffic distribution across instances | {"entries": [...]} |
8.2 Deploy test infrastructure¶
# Create resource group and P1v3 plan (2 instances for eviction testing)
az group create --name rg-healthcheck-lab --location koreacentral
az appservice plan create --name plan-healthcheck \
--resource-group rg-healthcheck-lab --sku P1v3 --is-linux \
--number-of-workers 2
# Create Python 3.11 web app with health check
az webapp create --name app-healthcheck-lab \
--resource-group rg-healthcheck-lab \
--plan plan-healthcheck --runtime "PYTHON:3.11"
# Configure health check and disable ARR affinity
az webapp config set --name app-healthcheck-lab \
--resource-group rg-healthcheck-lab \
--generic-configurations '{"healthCheckPath": "/healthz"}'
az webapp update --name app-healthcheck-lab \
--resource-group rg-healthcheck-lab \
--client-affinity-enabled false
# Set startup command and deploy
az webapp config set --name app-healthcheck-lab \
--resource-group rg-healthcheck-lab \
--startup-file "gunicorn --bind=0.0.0.0 --timeout 600 app:app"
az webapp deploy --name app-healthcheck-lab \
--resource-group rg-healthcheck-lab \
--src-path healthcheck-app.zip --type zip
8.3 Verify baseline — both instances healthy¶
- Send 20 requests to
/status, verify both instances appear withhealthy=True - Run
az webapp list-instances— both should showREADY
8.4 Test 1 — All instances unhealthy simultaneously¶
- POST
/fail-dependencyrepeatedly until both instances report unhealthy - Monitor traffic distribution every 2 minutes for 12+ minutes
- Verify no eviction occurs (both instances continue receiving traffic)
- POST
/recover-dependencyto restore both instances
8.5 Test 2 — Partial failure (one instance unhealthy)¶
- POST
/fail-dependencyselectively to Instance A only - Verify Instance A returns 503 on
/healthz, Instance B returns 200 - Monitor traffic distribution every 2 minutes
- Observe eviction event (Instance A stops receiving traffic)
- Verify via
az webapp list-instances— Instance A state changes
8.6 Test 3 — Cascading failure¶
- Start from partial failure state (Instance A evicted)
- POST
/fail-dependencyto Instance B (the only remaining instance) - Monitor whether Instance B gets evicted or stays in rotation
- Check instance states via
az webapp list-instances
8.7 Recovery test¶
- Execute
az webapp restartfrom cascading failure state - Measure time until both instances appear in traffic distribution
- Verify via
az webapp list-instances— both return to READY
8.8 Clean up¶
9. Expected signal¶
- Health check returns 503 after
/fail-dependencyon target instance - Instance removed from rotation after ~10 consecutive failures (~10 minutes)
- When ALL instances fail simultaneously, no eviction occurs
- When the last healthy instance fails, it remains in rotation
- After
az webapp restart, recovery within 1-2 minutes
10. Results¶
10.1 Test 1: All Instances Unhealthy Simultaneously¶
Both instances' /healthz endpoints returned 503 continuously for 12+ minutes.
| Time | Instance A | Instance B | Traffic Split | Eviction |
|---|---|---|---|---|
| T+0min | 503 | 503 | ~50/50 | No |
| T+2min | 503 | 503 | ~50/50 | No |
| T+4min | 503 | 503 | ~50/50 | No |
| T+6min | 503 | 503 | ~50/50 | No |
| T+8min | 503 | 503 | ~50/50 | No |
| T+10min | 503 | 503 | ~50/50 | No |
| T+12min | 503 | 503 | ~50/50 | No |
How to read this
After 12 minutes of continuous health check failures on BOTH instances, neither was evicted. The platform continued routing traffic to both instances equally. This confirms the protection mechanism: when all instances are unhealthy, App Service preserves the existing state to prevent total outage.
10.2 Test 2: Partial Failure (One Instance Unhealthy)¶
Instance A /healthz → 503. Instance B /healthz → 200.
| Time | Instance A Traffic | Instance B Traffic | Instance A State |
|---|---|---|---|
| T+0min | 50% (15/30) | 50% (15/30) | READY |
| T+2min | 40% (12/30) | 60% (18/30) | UNHEALTHY |
| T+4min | 43% (13/30) | 57% (17/30) | UNHEALTHY |
| T+6min | 60% (18/30) | 40% (12/30) | UNHEALTHY |
| T+8min | 47% (14/30) | 53% (16/30) | UNHEALTHY |
| T+10min | 0% (0/30) | 100% (30/30) | UNKNOWN |
Traffic to Instance A (UNHEALTHY):
T+0min ████████████████████████████████████████████████ 50%
T+2min ████████████████████████████████ 40%
T+4min ██████████████████████████████████ 43%
T+6min ████████████████████████████████████████████████ 60%
T+8min █████████████████████████████████████ 47%
T+10min 0% ← EVICTED
How to read this
For the first 8 minutes, traffic was distributed roughly equally to both instances — the platform did NOT gradually reduce traffic to the unhealthy instance. Then at ~10 minutes, eviction was instant and complete: Instance A went from receiving ~47% of traffic to exactly 0%. This is a binary on/off switch, not a gradual drain.
Post-eviction verification (50 additional requests): 50/50 (100%) went to Instance B.
Instance state via Azure API:
| Instance | State |
|---|---|
A (4b6100b2a00e) |
UNKNOWN |
B (ed56515e4ed9) |
READY |
ARRAffinity bypass attempt: Attempted to route to Instance A using ARRAffinity cookie with Instance A's full ID — failed. Traffic was routed to Instance B regardless. Evicted instances are completely removed from the load balancer; no client-side routing can reach them.
10.3 Test 3: Cascading Failure¶
Starting state: Instance A already evicted (UNKNOWN), Instance B serving 100% of traffic.
Step 1: Fail Instance B's dependency (POST /fail-dependency).
| Time After B Failure | Instance B Traffic | Instance B State | Notes |
|---|---|---|---|
| T+0min | 100% (30/30) | READY → UNHEALTHY | B is last instance |
| T+2min | 100% (30/30) | UNHEALTHY | Still serving |
| T+4min | 100% (30/30) | UNHEALTHY | Still serving |
| T+6min | 100% (30/30) | STOPPED | Still serving despite STOPPED state |
How to read this
Instance B's health check was returning 503 for 6+ minutes, yet it continued receiving 100% of traffic. The instance state transitioned from READY → STOPPED, but the platform kept routing to it because it was the last instance in rotation. App Service will never reduce to zero instances — even if the only remaining instance is unhealthy.
Instance states during cascading failure:
| Instance | State | In Rotation | Notes |
|---|---|---|---|
A (4b6100b2a00e) |
UNKNOWN | No | Evicted in Test 2 |
B (ed56515e4ed9) |
STOPPED | Yes | Last instance protection |
10.4 Recovery After Restart¶
az webapp restart issued at 2026-04-10T10:06:24Z with both instances in degraded state.
| Time After Restart | Instance A Traffic | Instance B Traffic | A State | B State |
|---|---|---|---|---|
| T+0s | — | — | UNKNOWN | STOPPED |
| T+15s | 60% | 40% | STOPPED | STOPPED |
| T+90s | 53% | 47% | STOPPED | READY |
| T+150s | 50% | 50% | READY | READY |
How to read this
Recovery was nearly instant — within 15 seconds of az webapp restart, both instances were serving traffic again. The instance state API lagged behind: Instance A transitioned through UNKNOWN → STOPPED → READY over ~150 seconds, even though it was already receiving and responding to requests.
10.5 Summary: Eviction Behavior Matrix¶
| Scenario | Eviction Occurs? | Time to Evict | Traffic During Eviction | Instance State |
|---|---|---|---|---|
| All instances unhealthy | No | N/A | 50/50 (unchanged) | Both remain in rotation |
| One instance unhealthy | Yes | ~10 minutes | Instant 0% → 100% shift | UNKNOWN |
| Last instance becomes unhealthy | No | N/A | 100% to last instance | STOPPED (but still serving) |
Recovery via az webapp restart |
N/A | ~15 seconds | Both serve immediately | READY after ~150s |
11. Interpretation¶
H1 — Partial eviction: CONFIRMED. When Instance A failed health checks while Instance B remained healthy, Instance A was removed from load balancer rotation [Observed] after exactly 10 minutes [Measured] (~10 consecutive failed health probes at 1-minute intervals). The eviction was binary — traffic shifted from ~50% to 0% instantly [Measured], with no gradual drain period [Observed].
H2 — Total failure protection: CONFIRMED. When both instances failed health checks simultaneously, neither was evicted [Observed] even after 12+ minutes of continuous failures [Measured]. The platform maintained the existing traffic distribution [Measured].
H3 — Cascading amplification: CONFIRMED with nuance. When the last healthy instance also became unhealthy, it was NOT evicted [Observed] — it remained in rotation with 100% of traffic [Measured]. The platform's protection mechanism prevents reducing to zero healthy instances [Inferred]. However, the instance state changed to STOPPED [Observed], which could mislead monitoring dashboards.
H4 — Recovery: CONFIRMED. az webapp restart restored both instances to active rotation within 15 seconds [Measured], though the Azure API's instance state lagged behind by ~150 seconds [Measured].
Key Discovery: Binary Eviction, Not Gradual Drain¶
The most significant finding is that health check eviction is an all-or-nothing switch [Inferred]:
- Before eviction: the unhealthy instance receives a normal share of traffic (~50%) [Measured]
- After eviction: the unhealthy instance receives exactly 0% of traffic [Measured]
- There is no "draining" period where traffic is gradually shifted [Observed]
This means that for the first ~10 minutes after a health check starts failing, users hitting the unhealthy instance will experience degraded service [Inferred]. The platform does not reduce traffic to unhealthy instances — it either routes normally or stops routing entirely [Observed].
Key Discovery: State API Lag¶
The az webapp list-instances state does not reflect real-time routing decisions [Observed]:
| Actual Behavior | Reported State |
|---|---|
| Instance evicted from LB | UNKNOWN |
| Instance receiving traffic but unhealthy | STOPPED |
| Instance just restored, serving traffic | STOPPED (for ~90-150s) |
| Instance fully operational | READY |
Monitoring systems that rely on instance state will report misleading status during transitions [Inferred].
12. What this proves¶
Evidence-based conclusions
- Health check eviction occurs after ~10 consecutive failed probes [Measured] (~10 minutes at 1-minute intervals).
- Eviction is binary: traffic shifts from ~50% to 0% instantly [Measured] — no gradual drain [Observed].
- When all instances are unhealthy, the platform does NOT evict any instance [Observed] — protects against total outage [Inferred].
- When the last remaining instance becomes unhealthy, it stays in rotation [Observed] (never reduces to zero).
- ARRAffinity cookies cannot route to evicted instances [Observed] — they are fully removed from the load balancer.
az webapp restartrecovers evicted instances to active routing within ~15 seconds [Measured].- Instance state API (
az webapp list-instances) lags behind actual routing decisions by 90-150 seconds [Measured].
13. What this does NOT prove¶
- Custom health check threshold: We tested only the default threshold. The
WEBSITE_HEALTHCHECK_MAXUNHEALTHYCOUNTsetting may alter the eviction timing. - Health check with custom interval: We used the default 1-minute interval. Custom intervals may affect the eviction timeline proportionally.
- Instance replacement: We did not verify whether App Service replaces evicted instances with new ones, or simply removes them from rotation. P1v3 does not auto-scale — this may behave differently on Consumption or Elastic Premium plans.
- Long-term eviction behavior: We observed eviction for ~10 minutes. It's unclear whether the platform eventually terminates (kills) a long-evicted instance or just keeps it running indefinitely.
- Health check with authentication: If the health check path requires authentication, the behavior may differ.
- Scale-in during eviction: We did not test whether the platform counts evicted instances toward the instance count or treats them as "missing."
14. Support takeaway¶
For support engineers
When a customer reports "app went down but only one dependency failed":
- Check if health check validates ALL dependencies — this is the most common cause of cascading eviction
- Ask how many instances were running — partial eviction only happens when at least one instance is healthy
- Check timing — health check eviction takes ~10 minutes, so a 2-minute database blip should NOT cause eviction
Key guidance:
- Health check endpoints should validate only critical dependencies that are required for ALL request paths
- If a dependency affects only some API endpoints, consider a shallow health check that returns 200 if the app process is alive, regardless of downstream health
- Design health checks with circuit breaker awareness: if a dependency is known to be down but expected to recover, the health check should not immediately return unhealthy
- After fixing the root cause,
az webapp restartis the fastest way to restore evicted instances (~15 seconds vs waiting for health check to pass ~10 consecutive times) - Do NOT rely on
az webapp list-instancesstate for real-time routing status — it lags by 90-150 seconds
Anti-pattern: "kitchen sink" health check
# ❌ BAD: Validates everything — any single failure triggers eviction
@app.route("/healthz")
def health():
check_database() # ← DB blip evicts the instance
check_redis() # ← Redis maintenance evicts the instance
check_storage() # ← Storage throttling evicts the instance
return "OK", 200
# ✅ GOOD: Validates only that the app can serve requests
@app.route("/healthz")
def health():
return "OK", 200
# ✅ BETTER: Separate liveness from readiness
@app.route("/healthz")
def health():
check_app_process_alive() # Lightweight check
return "OK", 200
@app.route("/ready") # NOT configured as health check path
def ready():
check_database()
check_redis()
return "OK", 200
15. Reproduction notes¶
- Health check interval is 1 minute by default; eviction happens after ~10 consecutive failures
- The
/healthzpath must return HTTP 200 to be considered healthy; any other status code (including 3xx redirects) counts as failure - Test with 2+ instances to observe differential eviction behavior
- ARR Affinity should be disabled to observe load balancer distribution clearly
- P1v3 plan was used for this experiment; Consumption and Premium plans may have different eviction thresholds
- The in-memory dependency simulation means each instance has independent failure state — this accurately models scenarios where a downstream dependency is unavailable from some instances but not others (e.g., regional DNS issues, network partitions)
az webapp restartperforms a soft restart (process restart, not container recreation) — this is sufficient to re-register the instance with the health check system- The test application source code is available in the
data/app-service/health-check-eviction/directory