Troubleshooting Method¶

Use a repeatable method to avoid random changes during incidents: Observe -> Hypothesize -> Test -> Conclude.

Core Method¶

flowchart LR
    A[Observe] --> B[Hypothesize]
    B --> C[Test]
    C --> D{Evidence Supports?}
    D -- Yes --> E[Conclude and Remediate]
    D -- No --> F[Refine Hypothesis]
    F --> C

Step 1: Observe¶

Collect objective signals before acting:

Elastic Beanstalk events (eb events, describe-events).
Health state and causes (eb health, enhanced health APIs).
Logs (eb logs bundles and key files).
CloudWatch metrics and alarms.
CloudFormation stack events for environment lifecycle failures.

Observation checklist:

Exact symptom and first timestamp (UTC).
Blast radius (single endpoint, single instance, whole environment, multiple environments).
Last known-good timestamp.
Recent changes (deploy, config update, scaling activity, platform update).

Step 2: Hypothesize¶

Create one testable hypothesis at a time.

Good hypothesis examples:

"Deploy failed because a postdeploy hook exits non-zero on missing file path."
"503 responses occur because targets fail health checks on a wrong path."
"Latency spike is caused by dependency timeout, not CPU saturation."

Bad hypothesis examples:

"AWS is down."
"The app is probably broken."

Step 3: Test¶

Design low-risk tests that isolate one variable.

Prefer read-only checks first.
Apply one controlled change if required.
Measure immediate impact using health/events/metrics.
Stop and revert if blast radius increases.

Common test patterns:

Validate startup command separately from full deployment.
Compare health check endpoint behavior from instance and via load balancer.
Temporarily scale out to determine capacity bottleneck involvement.
Re-run deployment with known-good artifact to isolate artifact regressions.

Step 4: Conclude¶

Conclude only when evidence chain is consistent.

Record symptom, hypothesis, test, outcome, and final remediation.
Capture preventive follow-up tasks (monitoring, guardrails, automation).
Link to the playbook used or created.

EB-Specific Tool Usage Guide¶

`eb events`¶

Best for control-plane chronology and immediate failure clues.
Use first when deployment/update behavior is involved.

`eb health`¶

Best for real-time environment and instance health causes.
Use during runtime incidents and after every remediation action.

`eb logs`¶

Best for pulling instance-side and application logs quickly.
Use after events indicate host, process, or hook-level errors.

CloudWatch Metrics and Alarms¶

Best for trend analysis and correlation with load/latency/error rates.
Use to distinguish transient spikes from persistent degradation.

Evidence Collection Guide¶

flowchart TD
    A[Incident Start Time] --> B[Collect Events]
    B --> C[Collect Health Causes]
    C --> D[Collect Logs]
    D --> E[Collect Metrics and Alarm States]
    E --> F[Correlate by UTC Timestamp]
    F --> G[Build Causal Narrative]

Minimum evidence bundle:

Environment metadata: name, region, platform branch, tier.
Event timeline excerpt.
Health cause messages and per-instance state.
Key log excerpts from proxy, app, and deployment engine.
CloudWatch metric snapshots around incident window.
Remediation attempts and outcomes.

Command Reference¶

eb events --environment "$ENV_NAME" --profile "eb-ops"

eb health --environment "$ENV_NAME" --profile "eb-ops" --refresh

eb logs --environment "$ENV_NAME" --profile "eb-ops"

aws cloudwatch describe-alarms \
    --alarm-name-prefix "awseb-" \
    --profile "eb-ops" \
    --region "$REGION"

Sources¶

https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/troubleshooting.html
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/eb-cli3.html
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/health-enhanced.html
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.logging.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html