Scaling Best Practices¶

Scaling guidance in App Service is a design decision, not only an operational toggle. This document helps you choose the right scaling model based on workload characteristics, architecture constraints, and cost boundaries.

Scaling Objectives¶

A production scaling strategy should balance:

Performance under variable traffic
Availability during instance loss or maintenance
Cost efficiency at steady state and peak
Predictable behavior during sudden load changes

Design judgment layer

Platform documentation explains what scaling options exist. This guide explains when and why to choose each option.

Prerequisites¶

Before tuning scaling behavior:

Establish baseline traffic and latency metrics
Define SLOs (availability, p95 latency, error rate)
Configure health checks and application telemetry
Validate statelessness assumptions for horizontal scaling

Vertical vs Horizontal Scaling Decision¶

Both scale-up and scale-out are valid. The right choice depends on bottlenecks and application architecture.

Vertical Scaling (Scale Up)¶

Scale up increases CPU/RAM resources per instance by changing App Service plan SKU.

Use when:

CPU or memory pressure is constant even at low instance counts
Application has limited horizontal concurrency capability
Dependencies impose per-instance connection or session constraints

Limitations:

Bigger instances can still become single bottlenecks
Some outages affect all workload on the plan
Unit cost can rise quickly at higher tiers

Horizontal Scaling (Scale Out)¶

Scale out increases instance count.

Use when:

Workload is mostly stateless
Throughput needs are bursty or seasonal
You need fault tolerance across multiple instances

Limitations:

Session state and cache locality can complicate behavior
Downstream systems must tolerate increased parallel calls

Portal view: Scale out (App Service plan) blade¶

The Scale out blade is the operational expression of the horizontal-scaling decision tree below. The three Scale out method radio buttons separate the three valid strategies in practice: Manual for fixed capacity, Automatic for platform-managed scaling against HTTP traffic, and Rules Based for the custom autoscale rules this guide recommends for CPU and queue-length triggers. The visible defaults here are also the most common single-region anti-patterns called out elsewhere in this document: Active instance count: 1 is a single failure domain, and Zone Redundancy is unchecked because zone redundancy requires a minimum of two instances. Validate this blade matches the design intent — Maximum scale (instance): 30 is the SKU ceiling, but the autoscale max instance count you choose must sit below it with headroom for emergency manual scaling.

Scaling Decision Tree¶

flowchart TD
    A[Workload latency or saturation issue] --> B{Primary bottleneck identified?}
    B -- CPU or memory per instance --> C[Scale up first]
    B -- Request concurrency or burst traffic --> D[Scale out first]
    C --> E{Still violating SLO?}
    E -- Yes --> F[Combine scale up and scale out]
    E -- No --> G[Keep and monitor]
    D --> H{App stateless and dependency-safe?}
    H -- Yes --> I[Enable autoscale rules]
    H -- No --> J[Refactor state handling or use ARR affinity carefully]
    I --> K[Set min/max and cool-down policies]

Auto-Scale Rules Design¶

Autoscale rules should be intentional and measurable. Avoid reactive chaos by using small, tested rule sets.

CPU-Based Rules¶

CPU is useful as a broad saturation indicator.

Example baseline:

Scale out when average CPU > 70% for 10 minutes
Scale in when average CPU < 35% for 20 minutes

HTTP Queue Length Rules¶

Queue length is a strong signal when request concurrency exceeds instance capacity.

Use when:

CPU is not saturated but response times rise
Workloads have blocking I/O and thread-pool pressure

Custom Metrics Rules¶

Custom metrics are valuable for domain-specific bottlenecks.

Examples:

Active background jobs
Queue backlog size
Domain transaction latency percentile

Avoid conflicting rules

Multiple aggressive rules on different metrics can cause oscillation (thrashing). Use cool-down windows and clear priority logic.

Scale-Out Limits per SKU¶

App Service scaling limits vary by tier and region capability. Plan limits should be checked before setting autoscale ceilings.

Design recommendations:

Set max instance count below hard platform limits
Reserve headroom for emergency manual scaling
Re-check limits before seasonal traffic periods

Treat limits as design inputs

Do not discover scaling limits during an incident. Validate limits in advance and document expected capacity.

Per-App Scaling¶

Multiple apps can share one App Service plan. Per-app scaling allows each app to scale independently within plan constraints.

Use per-app scaling when:

Shared plan contains workloads with different traffic patterns
One app should not over-consume instances needed by others
Cost optimization requires plan sharing with controlled isolation

Trade-offs:

Capacity planning becomes more complex
Noisy-neighbor risk still exists at plan resource level

Local Cache and ARR Affinity Considerations¶

Scaling strategy is tightly coupled to state behavior.

ARR Affinity¶

ARR affinity (sticky sessions) pins clients to instances.

Helpful for legacy session-in-memory patterns
Can hurt even load distribution during scale-out
Can cause uneven utilization and tail latency issues

Recommendation:

Prefer external distributed session stores over ARR affinity
Disable ARR affinity for truly stateless workloads

Local Cache¶

Local cache can improve read performance for some workloads, but do not treat it as a shared persistence layer.

Instance-local cache is not durable
Cache warm-up behavior affects scale-out events
Ensure cache miss paths are dependency-safe

Stateless first

If scale-out reliability is a goal, design for stateless request processing and externalize mutable state.

Practical Autoscale Configuration Pattern¶

Set minimum instance count based on baseline traffic and HA needs
Set maximum instance count from cost and limit analysis
Add one primary scale-out metric and one scale-in metric
Test with controlled load profile
Tune thresholds after observing production behavior

# Example: set always-on for production workload
az webapp config set \
    --resource-group $RG \
    --name $APP_NAME \
    --always-on true

# Example: show current App Service plan tier and capacity context
az appservice plan show \
    --resource-group $RG \
    --name $PLAN_NAME

Common Scaling Failure Modes¶

Mode 1: Scaling Does Not Improve Latency¶

Likely cause: downstream dependency bottleneck (database, external API).

Action:

Add dependency latency telemetry
Add retry with backoff and circuit breaker controls
Consider dependency-side scaling or caching strategy

Mode 2: Instance Thrashing¶

Likely cause: overly sensitive thresholds and short cool-down periods.

Action:

Increase evaluation period
Add scale-in delay
Remove duplicate or conflicting rules

Mode 3: Uneven Load Across Instances¶

Likely cause: sticky sessions and cached state pinned to hot instances.

Action:

Reduce ARR affinity usage
Externalize session state
Validate load-balancer behavior with synthetic traffic

Capacity Planning Baseline¶

For each app, maintain a simple capacity sheet with:

Requests per second at steady and peak windows
Per-instance throughput estimate at acceptable latency
Safety factor for unexpected demand spikes
Maximum expected dependency concurrency

Capacity formula example:

required_instances = (peak_rps / per_instance_rps) * safety_factor

Governance and Review Cadence¶

Monthly review of autoscale metrics and incidents
Pre-event scale rehearsal for known peak periods
Post-incident scaling retrospective with updated runbook

Scaling Best Practices¶

Scaling Objectives¶

Prerequisites¶

Vertical vs Horizontal Scaling Decision¶

Vertical Scaling (Scale Up)¶

Horizontal Scaling (Scale Out)¶

Portal view: Scale out (App Service plan) blade¶

Scaling Decision Tree¶

Auto-Scale Rules Design¶

CPU-Based Rules¶

HTTP Queue Length Rules¶

Custom Metrics Rules¶

Scale-Out Limits per SKU¶

Per-App Scaling¶

Local Cache and ARR Affinity Considerations¶

ARR Affinity¶

Local Cache¶

Practical Autoscale Configuration Pattern¶

Common Scaling Failure Modes¶

Mode 1: Scaling Does Not Improve Latency¶

Mode 2: Instance Thrashing¶

Mode 3: Uneven Load Across Instances¶

Capacity Planning Baseline¶

Governance and Review Cadence¶

See Also¶

Sources¶