Health Turns Red After Successful Deploy¶

1. Summary¶

The deployment reports success, but the environment transitions from Ok or Warning to Severe shortly after traffic and health checks hit the new version.

flowchart TD
    A[Health Turns Red After Successful Deploy] --> B{Primary branch}
    B --> C1[Startup crash after traffic shift]
    C1 --> D1[Collect logs, metrics, and platform signals]
    B --> C2[Health endpoint contract changed]
    C2 --> D2[Collect logs, metrics, and platform signals]
    B --> C3[Host saturation on new version]
    C3 --> D3[Collect logs, metrics, and platform signals]
    B --> C4[Downstream dependency failure]
    C4 --> D4[Collect logs, metrics, and platform signals]

2. Common Misreadings¶

Rollback means the old version is broken.
The last event is always the root cause.
A successful upload proves the rollout is safe.
No user-facing 5xx means the deployment was healthy.
Local success makes platform differences irrelevant.

3. Competing Hypotheses¶

- H1: Startup crash after traffic shift — Primary evidence should confirm or disprove whether startup crash after traffic shift.
- H2: Health endpoint contract changed — Primary evidence should confirm or disprove whether health endpoint contract changed.
- H3: Host saturation on new version — Primary evidence should confirm or disprove whether host saturation on new version.
- H4: Downstream dependency failure — Primary evidence should confirm or disprove whether downstream dependency failure.

4. What to Check First¶

Metrics¶

Check EB health colors and instance counts during the rollout window.
Check deployment-policy behavior: All at Once, Rolling, Rolling with Additional Batch, or Immutable.
Check whether replacement capacity converged or stalled.

Logs¶

Read eb-activity.log for the first failing lifecycle stage.
Read eb-engine.log for package, hook, and appdeploy details.
Read web.stdout.log and nginx/error.log only after confirming the issue reached process start or health checks.

Platform Signals¶

Run eb status --environment-name $ENV_NAME and preserve eb events --environment-name $ENV_NAME --all before retrying.
Capture the first Warning, Degraded, or Severe transition in UTC.
Record whether the issue is one batch, one instance, or the entire environment.

Signal	Normal	Abnormal	Why it matters
EB events	Clear start, deploy, and stable `Ok` state	Warnings, rollbacks, or unhealthy transitions appear	Separates clean rollouts from failing lifecycle stages
Enhanced health	Brief `Warning` is followed by `Ok`	`Warning`, `Degraded`, or `Severe` persists	Shows whether the failure reached runtime readiness
Platform logs	Ordered deploy steps complete successfully	A hook, deploy step, or bootstrap action exits non-zero	Reveals the first failing stage
Replacement capacity	Desired and in-service counts converge quickly	Launches stall or batches never stabilize	Distinguishes app failure from capacity failure

5. Evidence to Collect¶

Required Evidence¶

First symptom timestamp in UTC.
One healthy comparison sample if available.
Relevant EB health color transitions (Ok, Warning, Degraded, Severe).
Exact app version, platform branch, and environment name.

Useful Context¶

Whether the symptom started after deploy, config change, platform update, or traffic change.
Whether the issue is isolated to one instance, one batch, one subnet, or the full environment.
Any recent changes to health checks, listeners, routes, worker counts, dependencies, or deployment policy.

CLI Investigation Commands¶

1. Capture deployment state, health, and events¶

eb status --environment-name $ENV_NAME
eb events --environment-name $ENV_NAME --all
aws elasticbeanstalk describe-environment-health --environment-name $ENV_NAME --attribute-names All

Example output:

Environment details for: $ENV_NAME
  Status: Ready
  Health: Warning
2026-04-07 09:14:11    INFO    Environment update is starting.
2026-04-07 09:15:08    WARN    Environment health has transitioned from Ok to Warning.

Tip

Use the first warning or severe transition as the anchor timestamp for every later query.

2. Pull deployment and instance logs¶

eb logs --environment-name $ENV_NAME --all
aws elasticbeanstalk request-environment-info --environment-name $ENV_NAME --info-type tail

Example output:

Logs were saved to /var/folders/.../logs-20260407.zip
INFO: Retrieved tail logs for i-xxxxxxxxxxxxxxxxx

Tip

If platform logs fail before application logs show normal traffic, stay on platform lifecycle hypotheses first.

3. Inspect configuration and replacement activity¶

aws elasticbeanstalk describe-configuration-settings --application-name $APP_NAME --environment-name $ENV_NAME
aws autoscaling describe-scaling-activities --auto-scaling-group-name $ASG_NAME --max-items 20

Example output:

OptionSettings:
  - Namespace: aws:autoscaling:updatepolicy:rollingupdate
    OptionName: RollingUpdateType
    Value: Health
Activities:
  - Description: Launching a new EC2 instance. StatusCode: Failed

Tip

Read this output against the exact incident window; stale success from another window can mislead you.

Evidence Timeline¶

sequenceDiagram
    participant EB as Elastic Beanstalk
    participant EC2 as Instance or batch
    participant APP as App process
    EB->>EC2: Start deploy or update step
    EC2->>EC2: Run bootstrap, hooks, and appdeploy
    EC2->>APP: Start process and health checks
    APP-->>EB: Ok, Warning, Degraded, or Severe
    Note over EB,APP: The first failing transition is the primary evidence point

Sample Log Patterns¶

2026/04/07 09:15:06 [ERROR] Deployment step failed with non-zero exit status
2026/04/07 09:15:17 [ERROR] Environment rollback initiated for i-xxxxxxxxxxxxxxxxx
2026/04/07 09:15:21 [error] 4110#4110: *7 connect() failed (111: Connection refused) while connecting to upstream
2026/04/07 09:15:28 [WARN] Environment health has transitioned from Warning to Severe

CloudWatch Logs Insights Queries with Example Output¶

Query 1. Find the earliest incident evidence¶

fields @timestamp, @message
| filter @message like /ELB-HealthChecker|timeout/
| sort @timestamp asc
| limit 20

Example results:

@timestamp	@message
2026-04-07 09:15:06	ELB-HealthChecker
2026-04-07 09:15:17	timeout

Tip

How to Read This: The first row is usually the best root-cause anchor; later rows are often downstream consequences.

Query 2. Find the most visible failure signatures¶

fields @timestamp, @message
| filter @message like /Unhandled exception|302/
| sort @timestamp desc
| limit 20

Example results:

@timestamp	@message
2026-04-07 09:15:21	Unhandled exception
2026-04-07 09:15:28	302

Tip

How to Read This: Compare these rows with EB health color transitions and deployment or traffic timing before acting.

6. Validation and Disproof by Hypothesis¶

H1: Startup crash after traffic shift¶

Confirm: - Logs, metrics, and platform state all point directly at this branch. - The first failing timestamp lines up with evidence expected for Startup crash after traffic shift.

Disprove: - The expected log or state change for this branch never appears. - Another branch has earlier, stronger, and more direct evidence.

H2: Health endpoint contract changed¶

Confirm: - Logs, metrics, and platform state all point directly at this branch. - The first failing timestamp lines up with evidence expected for Health endpoint contract changed.

Disprove: - The expected log or state change for this branch never appears. - Another branch has earlier, stronger, and more direct evidence.

H3: Host saturation on new version¶

Confirm: - Logs, metrics, and platform state all point directly at this branch. - The first failing timestamp lines up with evidence expected for Host saturation on new version.

Disprove: - The expected log or state change for this branch never appears. - Another branch has earlier, stronger, and more direct evidence.

H4: Downstream dependency failure¶

Confirm: - Logs, metrics, and platform state all point directly at this branch. - The first failing timestamp lines up with evidence expected for Downstream dependency failure.

Disprove: - The expected log or state change for this branch never appears. - Another branch has earlier, stronger, and more direct evidence.

7. Likely Root Cause Patterns¶

A recent change shifted the failure into this playbook's domain.
The earliest warning was ignored and later symptoms obscured the first cause.
A platform, configuration, or dependency assumption drifted from the known-good state.
The environment had too little safety margin for rollout, load, or path changes.

8. Immediate Mitigations¶

Preserve the first-failure evidence before retrying or restarting anything.
Contain user impact with the smallest safe rollback, scale, or routing change.
Change only one suspected variable at a time and re-check health colors, logs, and metrics.
Confirm that the symptom, not just the dashboard noise, has improved.

9. Prevention¶

Keep environment configuration, health checks, and rollout assumptions under version control.
Test the same path in staging with the same platform branch and deployment policy.
Alert on the earliest signal for this failure mode, not only the final outage symptom.
Review baselines regularly so abnormal behavior is obvious during incidents.

Health Turns Red After Successful Deploy¶

1. Summary¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

Metrics¶

Logs¶

Platform Signals¶

5. Evidence to Collect¶

Required Evidence¶

Useful Context¶

CLI Investigation Commands¶

1. Capture deployment state, health, and events¶

2. Pull deployment and instance logs¶

3. Inspect configuration and replacement activity¶

Evidence Timeline¶

Sample Log Patterns¶

CloudWatch Logs Insights Queries with Example Output¶

Query 1. Find the earliest incident evidence¶

Query 2. Find the most visible failure signatures¶

6. Validation and Disproof by Hypothesis¶

H1: Startup crash after traffic shift¶

H2: Health endpoint contract changed¶

H3: Host saturation on new version¶

H4: Downstream dependency failure¶

7. Likely Root Cause Patterns¶

8. Immediate Mitigations¶

9. Prevention¶

See Also¶

Sources¶