Skip to content

Scaling Best Practices

Scaling guidance in App Service is a design decision, not only an operational toggle. This document helps you choose the right scaling model based on workload characteristics, architecture constraints, and cost boundaries.

Scaling Objectives

A production scaling strategy should balance:

  • Performance under variable traffic
  • Availability during instance loss or maintenance
  • Cost efficiency at steady state and peak
  • Predictable behavior during sudden load changes

Design judgment layer

Platform documentation explains what scaling options exist. This guide explains when and why to choose each option.

Prerequisites

Before tuning scaling behavior:

  • Establish baseline traffic and latency metrics
  • Define SLOs (availability, p95 latency, error rate)
  • Configure health checks and application telemetry
  • Validate statelessness assumptions for horizontal scaling

Vertical vs Horizontal Scaling Decision

Both scale-up and scale-out are valid. The right choice depends on bottlenecks and application architecture.

Vertical Scaling (Scale Up)

Scale up increases CPU/RAM resources per instance by changing App Service plan SKU.

Use when:

  • CPU or memory pressure is constant even at low instance counts
  • Application has limited horizontal concurrency capability
  • Dependencies impose per-instance connection or session constraints

Limitations:

  • Bigger instances can still become single bottlenecks
  • Some outages affect all workload on the plan
  • Unit cost can rise quickly at higher tiers

Horizontal Scaling (Scale Out)

Scale out increases instance count.

Use when:

  • Workload is mostly stateless
  • Throughput needs are bursty or seasonal
  • You need fault tolerance across multiple instances

Limitations:

  • Session state and cache locality can complicate behavior
  • Downstream systems must tolerate increased parallel calls

Portal view: Scale out (App Service plan) blade

Scale out (App Service plan) blade for a Linux plan with a Refresh button and a Send us your feedback action. The Pricing plan section lists Current plan (Premium v3 P0V3 with a Change link), Price (instance) 0.091 USD/hour (66.065 USD/month), Memory (GB) 4, Maximum scale (instance) 30, Active instance count 1, Maximum available zones 3 with a Get more info link, and a Zone Redundancy checkbox that is unchecked alongside an info banner reading "Minimum of two instances required." The Scaling section explains manual versus autoscale behavior and presents three Scale out method radio buttons — Manual (selected) with the subtitle "Maintain a constant instance count for your application", Automatic with "Platform managed scale out and in based on traffic", and Rules Based with "User defined rules to scale on a schedule or based on any app metric". An Instance count slider is set to 1, an Async scaling checkbox below it is unchecked, and Save and Discard buttons are at the bottom of the blade.

The Scale out blade is the operational expression of the horizontal-scaling decision tree below. The three Scale out method radio buttons separate the three valid strategies in practice: Manual for fixed capacity, Automatic for platform-managed scaling against HTTP traffic, and Rules Based for the custom autoscale rules this guide recommends for CPU and queue-length triggers. The visible defaults here are also the most common single-region anti-patterns called out elsewhere in this document: Active instance count: 1 is a single failure domain, and Zone Redundancy is unchecked because zone redundancy requires a minimum of two instances. Validate this blade matches the design intent — Maximum scale (instance): 30 is the SKU ceiling, but the autoscale max instance count you choose must sit below it with headroom for emergency manual scaling.

Scaling Decision Tree

flowchart TD
    A[Workload latency or saturation issue] --> B{Primary bottleneck identified?}
    B -- CPU or memory per instance --> C[Scale up first]
    B -- Request concurrency or burst traffic --> D[Scale out first]
    C --> E{Still violating SLO?}
    E -- Yes --> F[Combine scale up and scale out]
    E -- No --> G[Keep and monitor]
    D --> H{App stateless and dependency-safe?}
    H -- Yes --> I[Enable autoscale rules]
    H -- No --> J[Refactor state handling or use ARR affinity carefully]
    I --> K[Set min/max and cool-down policies]

Auto-Scale Rules Design

Autoscale rules should be intentional and measurable. Avoid reactive chaos by using small, tested rule sets.

CPU-Based Rules

CPU is useful as a broad saturation indicator.

Example baseline:

  • Scale out when average CPU > 70% for 10 minutes
  • Scale in when average CPU < 35% for 20 minutes

HTTP Queue Length Rules

Queue length is a strong signal when request concurrency exceeds instance capacity.

Use when:

  • CPU is not saturated but response times rise
  • Workloads have blocking I/O and thread-pool pressure

Custom Metrics Rules

Custom metrics are valuable for domain-specific bottlenecks.

Examples:

  • Active background jobs
  • Queue backlog size
  • Domain transaction latency percentile

Avoid conflicting rules

Multiple aggressive rules on different metrics can cause oscillation (thrashing). Use cool-down windows and clear priority logic.

Scale-Out Limits per SKU

App Service scaling limits vary by tier and region capability. Plan limits should be checked before setting autoscale ceilings.

Design recommendations:

  • Set max instance count below hard platform limits
  • Reserve headroom for emergency manual scaling
  • Re-check limits before seasonal traffic periods

Treat limits as design inputs

Do not discover scaling limits during an incident. Validate limits in advance and document expected capacity.

Per-App Scaling

Multiple apps can share one App Service plan. Per-app scaling allows each app to scale independently within plan constraints.

Use per-app scaling when:

  • Shared plan contains workloads with different traffic patterns
  • One app should not over-consume instances needed by others
  • Cost optimization requires plan sharing with controlled isolation

Trade-offs:

  • Capacity planning becomes more complex
  • Noisy-neighbor risk still exists at plan resource level

Local Cache and ARR Affinity Considerations

Scaling strategy is tightly coupled to state behavior.

ARR Affinity

ARR affinity (sticky sessions) pins clients to instances.

  • Helpful for legacy session-in-memory patterns
  • Can hurt even load distribution during scale-out
  • Can cause uneven utilization and tail latency issues

Recommendation:

  • Prefer external distributed session stores over ARR affinity
  • Disable ARR affinity for truly stateless workloads

Local Cache

Local cache can improve read performance for some workloads, but do not treat it as a shared persistence layer.

  • Instance-local cache is not durable
  • Cache warm-up behavior affects scale-out events
  • Ensure cache miss paths are dependency-safe

Stateless first

If scale-out reliability is a goal, design for stateless request processing and externalize mutable state.

Practical Autoscale Configuration Pattern

  1. Set minimum instance count based on baseline traffic and HA needs
  2. Set maximum instance count from cost and limit analysis
  3. Add one primary scale-out metric and one scale-in metric
  4. Test with controlled load profile
  5. Tune thresholds after observing production behavior
# Example: set always-on for production workload
az webapp config set \
    --resource-group $RG \
    --name $APP_NAME \
    --always-on true

# Example: show current App Service plan tier and capacity context
az appservice plan show \
    --resource-group $RG \
    --name $PLAN_NAME

Common Scaling Failure Modes

Mode 1: Scaling Does Not Improve Latency

Likely cause: downstream dependency bottleneck (database, external API).

Action:

  • Add dependency latency telemetry
  • Add retry with backoff and circuit breaker controls
  • Consider dependency-side scaling or caching strategy

Mode 2: Instance Thrashing

Likely cause: overly sensitive thresholds and short cool-down periods.

Action:

  • Increase evaluation period
  • Add scale-in delay
  • Remove duplicate or conflicting rules

Mode 3: Uneven Load Across Instances

Likely cause: sticky sessions and cached state pinned to hot instances.

Action:

  • Reduce ARR affinity usage
  • Externalize session state
  • Validate load-balancer behavior with synthetic traffic

Capacity Planning Baseline

For each app, maintain a simple capacity sheet with:

  • Requests per second at steady and peak windows
  • Per-instance throughput estimate at acceptable latency
  • Safety factor for unexpected demand spikes
  • Maximum expected dependency concurrency

Capacity formula example:

required_instances = (peak_rps / per_instance_rps) * safety_factor

Governance and Review Cadence

  • Monthly review of autoscale metrics and incidents
  • Pre-event scale rehearsal for known peak periods
  • Post-incident scaling retrospective with updated runbook

See Also

Sources