Skip to content

RDP and SSH Connection Failures

Symptoms

  • Administrative or workload impact is visible to users or operators.
  • The VM is deployed, but one part of the expected control plane or data path is failing.
  • You need a fast way to narrow the problem before making a risky change.
flowchart TD
    A[RDP and SSH Connection Failures] --> B[Confirm current symptom and blast radius]
    B --> C[Collect platform evidence first]
    C --> D[Collect guest or workload evidence]
    D --> E[Map findings to the most likely hypothesis]
    E --> F[Apply the smallest safe fix]
    F --> G[Validate recovery and prevention actions]

1. Summary

Use this playbook when administrators cannot sign in to a VM over RDP or SSH, receive timeouts, authentication prompts that always fail, or see the service listening but unreachable.

Public or private admin path, Bastion, JIT access, guest firewall, credential reset, and service health.

2. Common Misreadings

Observation Often misread as Actually means
One failed probe or one stale metric Total VM outage The issue may be scoped to one path or one recovery dependency.
A successful extension deployment Guest health is good Extensions can succeed while the underlying guest service still fails.
A recent change record Guaranteed root cause Recent changes are strong leads, but they still need proof from evidence.
A restart fixes the issue Permanent resolution Recovery after restart may only hide the real structural cause.

3. Competing Hypotheses

Hypothesis Likelihood Key discriminator
Control-plane or configuration drift High Azure resource state no longer matches the intended pattern.
Guest OS or agent issue High Guest or serial evidence shows service, boot, or firewall failure.
Capacity or platform dependency bottleneck Medium Metrics or SKU limits explain the symptom better than configuration drift.
Security control blocked expected behavior Medium NSG, ASG, JIT, or policy state changed before the incident.
External dependency issue Low VM appears healthy, but a downstream service path is broken.

4. What to Check First

  1. Review VM instance view

    az vm get-instance-view             --resource-group $RG             --name $VM_NAME             --output json
    
  2. Review boot diagnostics settings

    az vm boot-diagnostics get-boot-log             --resource-group $RG             --name $VM_NAME
    
  3. Review NIC effective security rules

    az network nic list-effective-nsg             --resource-group $RG             --name $NIC_NAME             --output json
    
  4. Review recent platform changes

    az monitor activity-log list             --resource-group $RG             --offset 24h             --output table
    

5. Evidence to Collect

5.1 KQL Queries

// Administrative sign-in and activity correlation
AzureActivity
| where TimeGenerated > ago(24h)
| where ResourceProviderValue =~ "MICROSOFT.COMPUTE"
| project TimeGenerated, OperationNameValue, Caller, ActivityStatusValue, ResourceId
| order by TimeGenerated desc
Field Interpretation
TimeGenerated Incident sequence and correlation window.
Resource identifier Confirms the signal belongs to the affected VM.
Operation or metric value Explains whether the failure is change-driven, capacity-driven, or guest-driven.

How to read this

Compare these results with the last known healthy window. A change in shape matters more than a single absolute value.

// Heartbeat plus security posture context
Heartbeat
| where TimeGenerated > ago(24h)
| summarize LastHeartbeat=max(TimeGenerated) by Computer, _ResourceId
| order by LastHeartbeat asc
Field Interpretation
TimeGenerated Incident sequence and correlation window.
Resource identifier Confirms the signal belongs to the affected VM.
Operation or metric value Explains whether the failure is change-driven, capacity-driven, or guest-driven.

How to read this

Compare these results with the last known healthy window. A change in shape matters more than a single absolute value.

5.2 CLI Investigation

az vm show     --resource-group $RG     --name $VM_NAME     --query "{powerState:powerState,vmSize:hardwareProfile.vmSize,priority:priority,provisioningState:provisioningState}"     --output json

az network nic show     --resource-group $RG     --name $NIC_NAME     --query "{ipConfigs:ipConfigurations[].privateIPAddress,acceleratedNetworking:enableAcceleratedNetworking,networkSecurityGroup:networkSecurityGroup.id}"     --output json

Interpretation:

  • If the VM is not in a usable power state, fix power and boot issues before guest remediation.
  • If the NIC or NSG binding is wrong, repair the control plane before changing guest settings.
  • If accelerated networking or disk attachment changed after a resize, validate that the new size still supports the intended feature set.

6. Validation and Disproof by Hypothesis

Hypothesis 1: Configuration drift or recent change

Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.

Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.

Recommended validation steps:

  1. Compare Azure Activity Log against the incident start time.
  2. Compare current configuration with the approved landing zone pattern.
  3. Review the most recent guest evidence or boot or connection output.
  4. Re-test the original symptom after the smallest safe correction.

Hypothesis 2: Guest OS or service failure

Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.

Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.

Recommended validation steps:

  1. Compare Azure Activity Log against the incident start time.
  2. Compare current configuration with the approved landing zone pattern.
  3. Review the most recent guest evidence or boot or connection output.
  4. Re-test the original symptom after the smallest safe correction.

Hypothesis 3: Capacity limit or SKU mismatch

Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.

Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.

Recommended validation steps:

  1. Compare Azure Activity Log against the incident start time.
  2. Compare current configuration with the approved landing zone pattern.
  3. Review the most recent guest evidence or boot or connection output.
  4. Re-test the original symptom after the smallest safe correction.

Hypothesis 4: Security control blocking the expected path

Proves if: Evidence clearly shows the current state changed from the last healthy pattern in a way that explains the symptom.

Disproves if: Control-plane, guest, and capacity evidence all remain healthy and consistent with baseline.

Recommended validation steps:

  1. Compare Azure Activity Log against the incident start time.
  2. Compare current configuration with the approved landing zone pattern.
  3. Review the most recent guest evidence or boot or connection output.
  4. Re-test the original symptom after the smallest safe correction.

7. Likely Root Cause Patterns

Pattern Evidence Resolution
Unsupported feature after resize or redeploy Settings drift, feature not enabled, or SKU capability mismatch Move back to a supported size or re-enable the feature with validation.
NSG or route or JIT drift Effective rules do not match the intended admin or workload path Repair the policy and document the expected flow.
Guest service stopped or corrupted Serial, extension, or guest evidence points to OS-level failure Repair the guest service, driver, boot loader, or firewall configuration.
Performance bottleneck blamed as an outage CPU, memory, or disk metrics saturate before the user-visible failure Resize, retier, or redistribute workload pressure.

8. Immediate Mitigations

  1. Confirm whether the issue is path-related, policy-related, or credential-related before resetting accounts.
  2. Prefer Azure Bastion and JIT over permanent public RDP or SSH exposure.
  3. Use Run Command or VMAccess extensions only when the platform path works and you need guest-side repair.

Step-by-step resolution:

  1. Stabilize the VM or admin path without erasing forensic evidence.
  2. Correct the control-plane configuration first when Azure intent is clearly wrong.
  3. Apply guest-side repair only after confirming the platform path is healthy.
  4. Re-run the original command, probe, or sign-in flow to verify recovery.
  5. Record the exact evidence that proved the fix, not just that the symptom disappeared.

CLI commands commonly used during fixes:

az vm run-command invoke     --resource-group $RG     --name $VM_NAME     --command-id RunShellScript     --scripts "sudo systemctl status walinuxagent"

az vm restart     --resource-group $RG     --name $VM_NAME

9. Prevention

Prevention checklist

  • [ ] Keep a documented healthy baseline for VM size, NIC, disk, and admin-path settings
  • [ ] Alert on drift in critical VM security and connectivity controls
  • [ ] Test boot diagnostics, Bastion, serial console, and backup restore before production go-live
  • [ ] Review SKU feature compatibility before resize operations
  • [ ] Capture post-incident evidence and turn it into a reusable guardrail

See Also

Sources