ELB Health Check Failures¶

1. Summary¶

Application Load Balancer or Classic Load Balancer health checks fail against Elastic Beanstalk instances, causing targets to become unhealthy and traffic capacity to collapse. This is confusing because the application may appear reachable from a browser or locally, yet the specific health contract used by the load balancer is failing.

flowchart TD
    A[Health checks fail] --> B{Why probes fail}
    B --> C[Wrong path, code, or redirect]
    B --> D[App not ready or port mismatch]
    B --> E[Probe blocked by SG or network]
    B --> F[Probe too strict for startup time]
    C --> G[Inspect path protocol and status expectations]
    D --> H[Inspect local health endpoint and port binding]
    E --> I[Inspect SGs routes and target reasons]
    F --> J[Inspect startup time and health timeout]
    G --> K[Validate hypotheses]
    H --> K
    I --> K
    J --> K

Limitations

Some environments use ALB target groups while older ones may use CLB health settings.
Exact reason codes differ slightly by load balancer type.
Use the configured health path, not an assumed default, when validating.

Quick Conclusion

Ask what exact path, code, port, and protocol the load balancer expects.
Most health-check failures are bad health endpoint behavior, slow startup, or blocked traffic.

2. Common Misreadings¶

"The site loads in a browser, so health checks should pass." The health path may differ.
"Health checks only test availability." Status code and timeout rules matter.
"A redirect is fine for health checks." ALB expects configured success codes.
"If one target passes, all should pass." Startup timing and host-local issues can differ.
"Probe failures mean the app is completely down." Some routes can still work while the health contract fails.

3. Competing Hypotheses¶

ID	Hypothesis	Mechanism	Predictive Signal
H1	Wrong health path or status behavior	Configured path redirects, authenticates, or returns non-success code	Target-health reason reports response-code mismatch
H2	App not ready or wrong port binding	Process is not listening where/when the probe expects	Local health call or port probe fails
H3	Probe path is blocked by network controls	ALB cannot reach target on the probe path	Timeout or unreachable reason plus SG evidence
H4	Probe timing is too strict for startup behavior	App becomes healthy eventually but not within grace/timeout	Failures cluster during startup or deployments

4. What to Check First¶

Inspect target-health reason codes.

aws elbv2 describe-target-health --target-group-arn $TARGET_GROUP_ARN

Confirm target-group health settings.

aws elbv2 describe-target-groups --target-group-arns $TARGET_GROUP_ARN

Pull application and proxy logs.

eb logs --environment-name $ENV_NAME --all
sudo less /var/log/web.stdout.log
sudo less /var/log/nginx/error.log

Verify local health path behavior from the instance.

curl --silent --show-error --location --max-time 5 http://127.0.0.1/health

5. Evidence to Collect¶

Required Evidence¶

Target-group health configuration and reason codes.
The exact application health path or default path in use.
Local response of that path from an instance.
App/proxy logs during probe failures.

Useful Context¶

Recent route, auth, redirect, or middleware changes.
Startup duration and deployment timing.
Security group changes between ALB and instances.

CLI Investigation Commands¶

aws elbv2 describe-target-groups --target-group-arns $TARGET_GROUP_ARN
aws elbv2 describe-target-health --target-group-arn $TARGET_GROUP_ARN
aws elasticbeanstalk describe-configuration-settings \
    --application-name $APP_NAME \
    --environment-name $ENV_NAME
aws ec2 describe-security-groups --group-ids $ALB_SECURITY_GROUP_ID $INSTANCE_SECURITY_GROUP_ID

CloudWatch Logs Insights Queries¶

fields @timestamp, @message
| filter @message like /health|upstream|301|302|401|403|404|502|504/
| sort @timestamp asc
| limit 100

6. Validation and Disproof by Hypothesis¶

H1. Wrong health path or status behavior¶

Evidence that SUPPORTS¶

Evidence	Why it supports H1
Target-health reason reports code mismatch	Path responds, but not acceptably
Local call to the configured path returns redirect or auth challenge	Health contract is wrong

Evidence that DISPROVES¶

Evidence	Why it disproves H1
Configured path returns fast success code	Path behavior is correct
Target-health failure is timeout, not code mismatch	Another issue dominates

Validation Commands¶

aws elbv2 describe-target-health --target-group-arn $TARGET_GROUP_ARN
curl --silent --show-error --location --max-time 5 http://127.0.0.1/health

Normal vs Abnormal¶

Signal	Normal	Abnormal
Health path response	Deterministic success code	Redirect, auth, 4xx, or 5xx
Target-health reason	Healthy	Response-code mismatch

H2. App not ready or wrong port binding¶

Evidence that SUPPORTS¶

Evidence	Why it supports H2
Local health path fails or times out	App is not ready
Logs show bind failure or slow startup	Probe arrives before stable readiness

Evidence that DISPROVES¶

Evidence	Why it disproves H2
App responds locally on the expected path and port	Startup and binding are fine
Failures continue even after warm steady state	Another issue is primary

Validation Commands¶

curl --silent --show-error --location --max-time 5 http://127.0.0.1/health
sudo less /var/log/web.stdout.log

Normal vs Abnormal¶

Signal	Normal	Abnormal
Local readiness	Fast success	Timeout, refusal, or bind error
Startup profile	Reaches ready in expected window	Slow or failed startup

H3. Probe path is blocked by network controls¶

Evidence that SUPPORTS¶

Evidence	Why it supports H3
SG review shows missing ALB-to-instance rule	Probe cannot reach target
Target-health reason is timeout or unreachable	Network path fits the symptom

Evidence that DISPROVES¶

Evidence	Why it disproves H3
Required SG path exists and local health works	Network block is less likely
Code mismatch rather than timeout is reported	Health contract is more relevant

Validation Commands¶

aws ec2 describe-security-groups --group-ids $ALB_SECURITY_GROUP_ID $INSTANCE_SECURITY_GROUP_ID
aws elbv2 describe-target-health --target-group-arn $TARGET_GROUP_ARN

Normal vs Abnormal¶

Signal	Normal	Abnormal
Probe network path	Allowed by SGs and routes	Timeout or unreachable due to blocked path
Target-health reason	Healthy or code-based reason	Timeout/unreachable with network evidence

H4. Probe timing is too strict for startup behavior¶

Evidence that SUPPORTS¶

Evidence	Why it supports H4
Failures cluster only during deployments or restarts	App eventually becomes healthy
Startup logs show successful readiness after the probe budget expires	Timing is too tight

Evidence that DISPROVES¶

Evidence	Why it disproves H4
Target remains unhealthy indefinitely	Not just a timing problem
App is ready quickly yet probes fail	Another issue dominates

Validation Commands¶

aws elbv2 describe-target-groups --target-group-arns $TARGET_GROUP_ARN
sudo less /var/log/web.stdout.log

Normal vs Abnormal¶

Signal	Normal	Abnormal
Startup-to-healthy time	Within grace and timeout budget	App recovers only after probe deadline
Incident timing	No special deploy-only failures	Failures concentrated on rollout or restart

7. Likely Root Cause Patterns¶

Trigger	Root Cause	Evidence	Fix
Route/auth change	Wrong health endpoint semantics	Response-code mismatch	Restore lightweight unauthenticated path
Runtime change	Port or readiness mismatch	Local health failure or bind issue	Align app startup with EB expectations
Security hardening	Blocked ALB-to-target path	Timeout health reason and SG gap	Restore required probe path
Slower startup	Probe too strict	Deploy-only failures and late readiness	Tune startup and health timing carefully

8. Immediate Mitigations¶

Restore the last known-good health path if it changed.

aws elasticbeanstalk update-environment \
    --environment-name $ENV_NAME \
    --option-settings Namespace=aws:elasticbeanstalk:application,OptionName=Application Healthcheck URL,Value=/health

If a deployment introduced the issue, roll back to the last healthy application version.
Restore the ALB-to-instance probe path in security groups if it was removed.
If startup timing is the only problem, shorten startup work before relaxing health timing.

9. Prevention¶

Treat the health endpoint as an operational contract and test it explicitly.
Keep health checks unauthenticated and dependency-light.
Validate target-group health after every routing, auth, or proxy change.
Alert on target-health reason changes, not only environment color.
Benchmark startup duration before rollout-policy changes.

Sources¶

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/health-enhanced.html
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.logging.html

ELB Health Check Failures¶

1. Summary¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

5. Evidence to Collect¶

Required Evidence¶

Useful Context¶

CLI Investigation Commands¶

CloudWatch Logs Insights Queries¶

6. Validation and Disproof by Hypothesis¶

H1. Wrong health path or status behavior¶

Evidence that SUPPORTS¶

Evidence that DISPROVES¶

Validation Commands¶

Normal vs Abnormal¶

H2. App not ready or wrong port binding¶

Evidence that SUPPORTS¶

Evidence that DISPROVES¶

Validation Commands¶

Normal vs Abnormal¶

H3. Probe path is blocked by network controls¶

Evidence that SUPPORTS¶

Evidence that DISPROVES¶

Validation Commands¶

Normal vs Abnormal¶

H4. Probe timing is too strict for startup behavior¶

Evidence that SUPPORTS¶

Evidence that DISPROVES¶

Validation Commands¶

Normal vs Abnormal¶

7. Likely Root Cause Patterns¶

8. Immediate Mitigations¶

9. Prevention¶

See Also¶

Sources¶