Lab: Instance Degraded Health¶

Reproduce a partial-fleet problem where one Elastic Beanstalk instance remains Degraded while the environment still has enough healthy capacity to serve traffic.

Lab Metadata¶

Attribute	Value
Difficulty	Intermediate
Duration	35 minutes
Tier	Load-balanced web server environment
Failure Mode	One instance reports degraded health because of repeated local failures while peers stay healthy
Skills Practiced	Instance-level health cause analysis, per-instance log comparison, replacement decision making

1) Background¶

1.1 Why this lab exists¶

Fleet-wide dashboards can hide single-instance issues. This lab builds the habit of drilling from environment health to instance health before scaling or redeploying unnecessarily.

1.2 Platform behavior model¶

Enhanced health tracks both environment and individual instance conditions. If one node has repeated local failures, EB can mark that node Degraded or Severe while the environment remains partially available.

1.3 Diagram (Mermaid)¶

flowchart LR
    A[Environment] --> B[Instance A Healthy]
    A --> C[Instance B Degraded]
    A --> D[Instance C Healthy]
    C --> E[Local app or dependency failure]
    E --> F[Per-instance health cause]

2) Hypothesis¶

2.1 Original hypothesis¶

One instance is degraded because its local application behavior differs from the rest of the fleet.

2.2 Causal chain¶

Localized fault on one instance -> repeated health check or app failures on that node -> instance health becomes Degraded -> environment remains available because other nodes stay healthy.

2.3 Proof criteria¶

describe-environment-health shows per-instance health divergence.
The degraded instance has different logs or health cause details than healthy peers.
Replacing the bad instance clears the symptom.

2.4 Disproof criteria¶

All instances exhibit the same failure, meaning the issue is environment-wide rather than instance-local.

3) Runbook¶

Deploy the baseline lab environment.

eb deploy "$ENV_NAME" --staged

Trigger a failure path on only one instance.

bash "trigger.sh"

Review instance-level health.

aws elasticbeanstalk describe-environment-health \
    --environment-name "$ENV_NAME" \
    --attribute-names Status Color Causes InstancesHealth

Compare logs from degraded and healthy nodes.

eb logs --environment-name "$ENV_NAME" --all
sudo less /var/log/eb-activity.log
sudo less /var/log/web.stdout.log

Force instance replacement and confirm recovery.

aws autoscaling terminate-instance-in-auto-scaling-group \
    --instance-id "$INSTANCE_ID" \
    --should-decrement-desired-capacity false

4) Experiment Log¶

Time (UTC)	Observation	Evidence
15:00	Fleet starts healthy	`describe-environment-health`
15:06	Single-instance trigger applied	`trigger.sh` output
15:10	One instance moves to `Degraded`, peers remain healthy	`InstancesHealth` output
15:13	Degraded node logs differ from healthy nodes	`eb logs` bundle
15:20	Replacing the bad instance restores full `Ok` state	Auto Scaling replacement evidence

Expected Evidence¶

Before Trigger (Baseline)¶

All instances report healthy.
No divergent health causes.

During Incident¶

One node shows Degraded or Severe.
Its logs contain unique errors or failing requests.
Other nodes continue to serve healthy responses.

After Recovery¶

Replacement node joins and becomes healthy.
Environment health returns to all-healthy capacity.

Evidence Timeline (Mermaid sequence diagram)¶

sequenceDiagram
    participant User
    participant EB as Elastic Beanstalk
    participant Bad as Bad Instance
    participant Good as Healthy Instance
    User->>Bad: Trigger local fault
    Bad-->>EB: Failing health data
    Good-->>EB: Healthy health data
    EB-->>User: One instance degraded
    User->>EB: Replace degraded instance
    EB-->>User: New healthy instance joins

Evidence Chain: Why This Proves the Hypothesis¶

The environment remains partially healthy while one instance diverges. That split, plus the unique logs on the failing node and recovery after replacement, demonstrates a localized instance problem instead of a platform-wide issue.

Clean Up¶

eb terminate "$ENV_NAME"
aws cloudformation delete-stack --stack-name "$STACK_NAME" --region "$AWS_REGION"

Instance Degraded Health

Lab: Instance Degraded Health¶

Lab Metadata¶

1) Background¶

1.1 Why this lab exists¶

1.2 Platform behavior model¶

1.3 Diagram (Mermaid)¶

2) Hypothesis¶

2.1 Original hypothesis¶

2.2 Causal chain¶

2.3 Proof criteria¶

2.4 Disproof criteria¶

3) Runbook¶

4) Experiment Log¶

Expected Evidence¶

Before Trigger (Baseline)¶

During Incident¶

After Recovery¶

Evidence Timeline (Mermaid sequence diagram)¶

Evidence Chain: Why This Proves the Hypothesis¶

Clean Up¶

See Also¶

Sources¶

Lab: Instance Degraded Health¶

Lab Metadata¶

1) Background¶

1.1 Why this lab exists¶

1.2 Platform behavior model¶

1.3 Diagram (Mermaid)¶

2) Hypothesis¶

2.1 Original hypothesis¶

2.2 Causal chain¶

2.3 Proof criteria¶

2.4 Disproof criteria¶

3) Runbook¶

4) Experiment Log¶

Expected Evidence¶

Before Trigger (Baseline)¶

During Incident¶

After Recovery¶

Evidence Timeline (Mermaid sequence diagram)¶

Evidence Chain: Why This Proves the Hypothesis¶

Clean Up¶

Related Playbook¶

See Also¶

Sources¶