Event Hub / Service Bus Trigger Lag¶

1. Summary¶

Trigger lag incidents happen when ingestion rate outpaces effective processing throughput, causing Event Hub partition backlog or Service Bus queue/subscription depth to grow over time. Lag is not only a scale problem; it is often a checkpoint progression problem where functions pull messages but fail to complete fast enough, fail to settle messages, or repeatedly reprocess poison payloads.

This playbook separates broker-side pressure from function-side bottlenecks, then validates whether lag originates from partition imbalance, lock renewal issues, dependency latency, host extension misconfiguration, or poison-message loops. It emphasizes evidence-driven triage so responders avoid over-scaling compute when the bottleneck is downstream I/O or message-level faults.

Decision Flow¶

flowchart TD
    A[Alert: trigger lag increasing] --> B{Broker type?}
    B -->|Event Hub| C[Check partition lag distribution]
    B -->|Service Bus| D["Check active/dead-letter counts"]
    C --> E{Lag uniform across partitions?}
    E -->|No| F[Hot partition or partition-key skew]
    E -->|Yes| G[Global throughput bottleneck]
    D --> H{Dead-letter growth?}
    H -->|Yes| I[Poison message handling issue]
    H -->|No| J[Slow processing or lock churn]
    G --> K{Dependency latency elevated?}
    K -->|Yes| L[External bottleneck]
    K -->|No| M["Batch/concurrency misconfiguration"]
    J --> N{Max batch or prefetch tuned?}
    N -->|No| O[Adjust extension settings]
    N -->|Yes| P[Investigate code path and serialization]
    I --> Q[Quarantine poison payloads]
    L --> R[Fail fast and circuit-break slow dependency]

Severity guidance¶

Condition	Severity	Action priority
Lag within warning threshold, no customer-visible delay	Sev3	Investigate in same day and tune config safely
Sustained lag growth >30 minutes, delayed processing SLO breach	Sev2	Mitigate within 30 minutes and protect downstream
Critical workflow delay with contractual impact or message loss risk	Sev1	Immediate containment, incident bridge, controlled throttle

Signal snapshot¶

Signal	Normal	Incident
Event Hub partition lag	Bounded, oscillating around steady state	One or more partitions continuously increasing
Service Bus active message count	Drains between producer bursts	Queue/subscription depth increases monotonically
Function processing duration (`FunctionAppLogs` / `requests`)	Stable p95 and bounded max	p95/p99 climb with periodic timeout/retry
Dependency latency (`dependencies`)	Low variance, low failure	High tail latency, transient network errors
Dead-letter / poison count	Rare and isolated	Sharp increase with repeated same message pattern

2. Common Misreadings¶

Misreading	Why incorrect	Correct interpretation
"Lag means we only need more instances"	Scaling out does not help if checkpoint does not advance or dependency is bottlenecked	Confirm checkpoint, lock, and dependency health before scaling
"No dead-letter means no bad messages"	Poison loops can occur before dead-letter thresholds are reached	Inspect delivery count distribution and repeated exception signatures
"Equal CPU usage means healthy consumers"	CPU can stay moderate while waiting on network/database calls	Combine compute metrics with `dependencies` latency and settlement time
"Large batch size always improves throughput"	Oversized batches increase per-invocation duration and lock-renew risk	Tune batch and concurrency to keep per-message latency bounded
"Partition lag is random noise"	Persistent skew usually indicates key distribution or partition hot spots	Compare partition-level lag trend and key cardinality

3. Competing Hypotheses¶

ID	Hypothesis	Confirming signal	Disproving signal
H1	Consumer throughput too low due to `maxBatchSize`/concurrency misconfiguration	Processing duration increases with low message completion per execution	Throughput remains high and stable despite lag growth
H2	Poison messages repeatedly fail and block checkpoint progression	Repeated exceptions for same message IDs, rising delivery count/dead-letter	No repeated IDs and low exception recurrence
H3	Dependency/network latency to broker or downstream service slows settlement	`dependencies` p95/p99 spikes coincide with lag increase	Dependency latency stable during lag window
H4	Event Hub partition imbalance causes localized backlog growth	One partition dominates lag while others drain	Lag uniformly distributed with no hot partition
H5	Service Bus lock renewal/settlement issues trigger redelivery	Lock lost or settlement errors in `traces`/`FunctionAppLogs`	No lock-related errors and successful completion ratio remains high
H6	Function timeout or long-running handler prevents checkpoint progress	Timeout and cancellation logs coincide with lag acceleration	No timeout/cancellation and fast completion per batch

4. What to Check First¶

Confirm whether lag increase is global or isolated (partition, queue, subscription, consumer group).
Check processing duration trend and exception rate in Application Insights over the same period.
Validate extension settings in host.json for Event Hub/Service Bus batch, prefetch, and concurrency.
Inspect dead-letter and delivery count indicators for poison-message behavior.

Quick portal checks¶

In Event Hub or Service Bus metrics, compare incoming rate versus completed/processed rate.
In Function App monitor view, identify top failing trigger functions and recent execution duration drift.
In Application Insights, inspect dependencies latency to broker and downstream services.

Quick CLI checks¶

az eventhubs eventhub show --name <event-hub-name> --namespace-name <event-hubs-namespace> --resource-group <resource-group> --output table
az servicebus queue show --name <queue-name> --namespace-name <service-bus-namespace> --resource-group <resource-group> --output table
az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "FunctionAppLogs | where TimeGenerated > ago(30m) | where Message has_any ('lag','checkpoint','lock lost','dead-letter','timeout') | project TimeGenerated, Level, Message | take 30" --output table

Example output¶

Name                 PartitionCount    MessageRetentionInDays
-------------------  ----------------  ----------------------
orders-telemetry     8                 3

Name                 CountDetailsActiveMessageCount  CountDetailsDeadLetterMessageCount
-------------------  ------------------------------  ---------------------------------
orders-processing    187420                           821

TimeGenerated                 Level    Message
----------------------------  -------  ---------------------------------------------------------------------------
2026-04-05T03:25:12Z          Warning  EventHub trigger lag increasing for partition 5; checkpoint delay exceeded threshold
2026-04-05T03:25:20Z          Error    ServiceBusException: MessageLockLost while completing message xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2026-04-05T03:25:41Z          Warning  Processing duration exceeded expected envelope for batch size 512

5. Evidence to Collect¶

KQL Table Names

Most queries use Application Insights table names (traces, requests, dependencies) with classic columns (timestamp, duration). FunctionAppLogs and AppMetrics are Log Analytics tables and use TimeGenerated.

Source	Query/Command	Purpose
`FunctionAppLogs`	Trigger execution duration, batch size, checkpoint and lock messages	Detect processing slowdown and settlement failures
`dependencies`	Broker/downstream latency and result code trend	Confirm external latency bottleneck
`traces`	Exception signatures, lock lost, retry behavior	Identify poison loop and lock renewal patterns
`requests`	Invocation duration percentiles by function	Quantify runtime impact window
`AppMetrics`	Custom lag metrics by partition/subscription	Measure lag slope and hotspot distribution
Event Hub metrics	Incoming messages, outgoing messages, throttle indicators	Distinguish producer burst from consumer shortfall
Service Bus metrics	Active, dead-letter, scheduled, transfer dead-letter counts	Validate queue health and poison escalation
`host.json` in deployment artifact	Effective extension settings for batch/prefetch/concurrency	Validate tuning assumptions

6. Validation and Disproof by Hypothesis¶

H1: Batch and concurrency settings limit effective throughput¶

Confirming KQL¶

FunctionAppLogs
| where TimeGenerated > ago(6h)
| where Message has_any ("EventHubTrigger", "ServiceBusTrigger", "Processed", "batch")
| extend BatchSize = toint(extract("batch.*?(\\d+)", 1, Message))
| extend DurationMs = toreal(extract("duration.*?(\\d+\\.?\\d*)", 1, Message))
| summarize AvgBatch=avg(BatchSize), P95Duration=percentile(DurationMs, 95), Runs=count() by FunctionName, bin(TimeGenerated, 15m)
| order by P95Duration desc

Expected output¶

FunctionName                 TimeGenerated            AvgBatch  P95Duration  Runs
---------------------------  -----------------------  --------  -----------  ----
ProcessOrdersEventHub        2026-04-05T03:15:00Z    512       187000       124
ProcessOrdersEventHub        2026-04-05T03:30:00Z    512       194500       118
ApplyPaymentsServiceBus      2026-04-05T03:30:00Z    256       142200       96

Disproving check¶

If duration stays low while lag grows, misconfiguration is less likely. Shift analysis to partition skew, poison loops, or upstream producer surge.

H2: Poison message retries block forward progress¶

Confirming KQL¶

traces
| where timestamp > ago(6h)
| where message has_any ("dead-letter", "poison", "DeliveryCount", "MessageLockLost", "Abandon")
| extend FunctionName=tostring(customDimensions.FunctionName)
| extend MessageId=tostring(customDimensions.MessageId)
| summarize Attempts=count(), FirstSeen=min(timestamp), LastSeen=max(timestamp) by FunctionName, MessageId
| where Attempts >= 3
| order by Attempts desc

Expected output¶

FunctionName              MessageId                               Attempts  FirstSeen                 LastSeen
------------------------  --------------------------------------  --------  ------------------------  ------------------------
ApplyPaymentsServiceBus   xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx    19        2026-04-05T02:40:03Z      2026-04-05T03:31:17Z
ApplyPaymentsServiceBus   xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx    14        2026-04-05T02:51:50Z      2026-04-05T03:26:09Z

Disproving check¶

If repeated message IDs are absent and dead-letter remains flat, poison-message loops are unlikely primary; evaluate throughput and dependency latency hypotheses.

H3: Dependency and broker latency regression slows processing¶

Confirming KQL¶

dependencies
| where timestamp > ago(6h)
| where target has_any ("servicebus.windows.net", "eventhub.windows.net", "database.windows.net", "vault.azure.net")
| summarize Count=count(), Failures=countif(success == false), P95=percentile(duration, 95), P99=percentile(duration, 99) by target, name, bin(timestamp, 15m)
| order by P99 desc

Expected output¶

target                                  name              Count  Failures  P95     P99
--------------------------------------  ----------------  -----  --------  ------  ------
contoso-bus.servicebus.windows.net      CompleteMessage   1820   96        48000   111000
contoso-hub.servicebus.windows.net      ReceiveMessages   2410   131       39000   94000
contoso-sql.database.windows.net        ExecuteReader     920    67        61000   149000

Disproving check¶

If dependency latency and failures remain steady across incident windows, external systems are less likely root cause; inspect in-process serialization cost or checkpoint logic.

H4: Event Hub partition skew creates hot partitions¶

Confirming KQL¶

AppMetrics
| where TimeGenerated > ago(6h)
| where Name has "eventhub_partition_lag"
| extend PartitionId=tostring(Properties.PartitionId)
| summarize AvgLag=avg(Value), MaxLag=max(Value) by PartitionId, bin(TimeGenerated, 15m)
| order by MaxLag desc

Expected output¶

PartitionId  TimeGenerated            AvgLag   MaxLag
-----------  -----------------------  -------  -------
5            2026-04-05T03:15:00Z    18422    23381
5            2026-04-05T03:30:00Z    19641    24810
2            2026-04-05T03:30:00Z    1211     1922
7            2026-04-05T03:30:00Z    944      1510

Disproving check¶

If lag is uniform across partitions with similar growth rates, partition skew is not the lead cause; inspect global throughput, dependency bottlenecks, or throttling.

H5: Service Bus lock renewal and settlement failures cause redelivery churn¶

Confirming KQL¶

FunctionAppLogs
| where TimeGenerated > ago(6h)
| where Message has_any ("MessageLockLost", "lock expired", "CompleteAsync", "Abandon", "RenewLock")
| extend FunctionName = tostring(FunctionName)
| summarize LockErrors=count(), DistinctOps=dcount(FunctionInvocationId) by FunctionName, bin(TimeGenerated, 15m)
| order by LockErrors desc

Expected output¶

FunctionName              TimeGenerated            LockErrors  DistinctOps
------------------------  -----------------------  ----------  -----------
ApplyPaymentsServiceBus   2026-04-05T03:15:00Z    84          63
ApplyPaymentsServiceBus   2026-04-05T03:30:00Z    96          71

Disproving check¶

If lock-related errors are negligible and completion succeeds consistently, settlement churn is unlikely and focus should return to batch and dependency constraints.

H6: Trigger handler timeout prevents checkpoint advancement¶

Confirming KQL¶

FunctionAppLogs
| where TimeGenerated > ago(6h)
| where Message has_any ("FunctionTimeoutException", "Execution was canceled")
| summarize TimeoutCount=count(), LastSeen=max(TimeGenerated) by FunctionName
| order by TimeoutCount desc

Expected output¶

FunctionName               TimeoutCount  LastSeen
-------------------------  ------------  ------------------------
ProcessOrdersEventHub      51            2026-04-05T03:43:09Z
ApplyPaymentsServiceBus    28            2026-04-05T03:41:55Z

Disproving check¶

If no timeout/cancellation signatures appear, checkpoint blockage likely originates from message-level failures or lock handling rather than runtime timeout boundary.

Failure Progression Timeline¶

flowchart LR
    A[Producer rate increases] --> B[Batch pulled by trigger]
    B --> C[Processing slows from dependency latency]
    C --> D["Checkpoint/settlement delayed"]
    D --> E[Lag accumulates]
    E --> F[Retries and lock renewals rise]
    F --> G[Dead-letter and SLA breach risk]

Broker-to-Function Bottleneck Map¶

flowchart TD
    A["Event Hub / Service Bus broker"] --> B[Trigger extension fetch]
    B --> C[Function execution]
    C --> D[Dependency calls]
    D --> E["Message settlement/checkpoint"]
    E --> F[Offset advances]
    C --> G{"Duration exceeds lock/timeout budget?"}
    G -->|Yes| H[Redelivery or timeout]
    H --> I[Lag increases]
    G -->|No| J[Steady drain]

Correlation Queries for Fast Triage¶

Lag slope versus processing duration¶

let lagSeries = AppMetrics
| where TimeGenerated > ago(3h)
| where Name has_any ("eventhub_partition_lag", "servicebus_queue_lag")
| summarize LagValue=avg(Value) by bin(TimeGenerated, 5m);
let durationSeries = requests
| where timestamp > ago(3h)
| summarize P95Duration=percentile(duration,95) by bin(timestamp, 5m);
lagSeries
| join kind=inner durationSeries on $left.TimeGenerated == $right.timestamp
| project TimeGenerated, LagValue, P95Duration
| order by TimeGenerated asc

TimeGenerated            LagValue  P95Duration
----------------------   --------  -----------
2026-04-05T03:10:00Z     8221      111000
2026-04-05T03:15:00Z     9304      129000
2026-04-05T03:20:00Z     10980     148000
2026-04-05T03:25:00Z     12410     171000

Service Bus lock failures by instance¶

FunctionAppLogs
| where TimeGenerated > ago(3h)
| where Message has_any ("MessageLockLost", "lock expired", "RenewLock")
| summarize LockErrors=count() by RoleInstance, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

RoleInstance     TimeGenerated            LockErrors
---------------  -----------------------  ----------
instance-01      2026-04-05T03:30:00Z    17
instance-01      2026-04-05T03:25:00Z    14
instance-03      2026-04-05T03:25:00Z    5

Dependency failures during lag acceleration¶

dependencies
| where timestamp > ago(3h)
| where target has_any ("servicebus.windows.net", "eventhub.windows.net", "database.windows.net")
| summarize Failures=countif(success == false), P99=percentile(duration,99) by target, bin(timestamp, 5m)
| order by timestamp desc

target                                  timestamp                Failures  P99
--------------------------------------  -----------------------  --------  ------
contoso-bus.servicebus.windows.net      2026-04-05T03:30:00Z    22        118000
contoso-hub.servicebus.windows.net      2026-04-05T03:30:00Z    19        97000
contoso-sql.database.windows.net        2026-04-05T03:30:00Z    11        146000

Interpretation¶

When lag slope and p95 duration rise together, throughput is constrained by processing latency. If lock errors concentrate on a subset of instances, prioritize instance-level diagnostics and lock-renew tuning before global scaling.

7. Likely Root Cause Patterns¶

Pattern	Evidence signature	Frequency
Overlarge batch size with insufficient concurrency	High per-run duration and low completion throughput	High
Poison message loop before dead-letter threshold	Repeated exception with same `MessageId` and growing delivery count	High
Dependency tail latency or intermittent network errors	`dependencies` p99 surge with correlated lag slope increase	Medium
Event Hub partition key skew / hot partition	One partition lag dominates while others remain near baseline	Medium
Lock lost and settlement churn in Service Bus	`MessageLockLost` spikes and high redelivery	Medium

8. Immediate Mitigations¶

Reduce per-invocation processing time by right-sizing batch and concurrency in host.json and redeploy quickly.

{
  "version": "2.0",
  "extensions": {
    "eventHubs": {
      "maxBatchSize": 256,
      "prefetchCount": 512,
      "batchCheckpointFrequency": 1
    },
    "serviceBus": {
      "maxConcurrentCalls": 32,
      "prefetchCount": 256,
      "maxAutoLockRenewalDuration": "00:10:00"
    }
  }
}

Quarantine poison messages: for Service Bus, configure maxDeliveryCount to dead-letter repeated failures quickly. For Event Hub, add application-level exception handling to skip poison payloads and log them for offline analysis.

# Check Service Bus dead-letter count and queue health
az servicebus queue show --name <queue-name> --namespace-name <service-bus-namespace> --resource-group <resource-group> --query "{activeMessageCount:countDetails.activeMessageCount, deadLetterMessageCount:countDetails.deadLetterMessageCount}" --output table

Validate broker and downstream latency before scaling out consumers blindly:

az monitor log-analytics query --workspace "$WORKSPACE_ID" --analytics-query "dependencies | where timestamp > ago(30m) | summarize p95=percentile(duration,95), failures=countif(success == false) by target" --output table

Temporarily throttle producer throughput or apply backpressure policy to stop uncontrolled lag growth while draining backlog.

az eventhubs eventhub authorization-rule keys list --name <rule-name> --eventhub-name <event-hub-name> --namespace-name <event-hubs-namespace> --resource-group <resource-group> --output table

Note

Producer throttling is application-specific. The auth key listed above is for connection verification, not throttling. Coordinate with the upstream producer team to reduce send rate.

Increase partition count only when partition skew and sustained producer throughput justify repartitioning plan.

az eventhubs eventhub update --name <event-hub-name> --namespace-name <event-hubs-namespace> --resource-group <resource-group> --partition-count 16 --output table

Redeploy with validated extension settings and monitor lag slope every 5 minutes until stable decline is confirmed.

az functionapp deployment source config-zip --name <app-name> --resource-group <resource-group> --src <path-to-package.zip>

9. Prevention¶

Set explicit lag SLOs (partition, queue, subscription) and alert on slope, not only absolute depth.
Treat poison handling as first-class design: classify non-retryable exceptions and dead-letter quickly.
Keep processing idempotent and checkpoint-safe so retries do not block progress or duplicate effects.
Baseline dependency p95/p99 and enforce timeout budgets so batch runs stay within lock and execution envelopes.
Validate host.json extension tuning in load tests for realistic producer bursts before release.

Event Hub / Service Bus Trigger Lag¶

1. Summary¶

Decision Flow¶

Severity guidance¶

Signal snapshot¶

2. Common Misreadings¶

3. Competing Hypotheses¶

4. What to Check First¶

Quick portal checks¶

Quick CLI checks¶

Example output¶

5. Evidence to Collect¶

6. Validation and Disproof by Hypothesis¶

H1: Batch and concurrency settings limit effective throughput¶

Confirming KQL¶

Expected output¶

Disproving check¶

H2: Poison message retries block forward progress¶

Confirming KQL¶

Expected output¶

Disproving check¶

H3: Dependency and broker latency regression slows processing¶

Confirming KQL¶

Expected output¶

Disproving check¶

H4: Event Hub partition skew creates hot partitions¶

Confirming KQL¶

Expected output¶

Disproving check¶

H5: Service Bus lock renewal and settlement failures cause redelivery churn¶

Confirming KQL¶

Expected output¶

Disproving check¶

H6: Trigger handler timeout prevents checkpoint advancement¶

Confirming KQL¶

Expected output¶

Disproving check¶

Failure Progression Timeline¶

Broker-to-Function Bottleneck Map¶

Correlation Queries for Fast Triage¶

Lag slope versus processing duration¶

Service Bus lock failures by instance¶

Dependency failures during lag acceleration¶

Interpretation¶

7. Likely Root Cause Patterns¶

8. Immediate Mitigations¶

9. Prevention¶

See Also¶

Sources¶