Polling Trigger Runtime Semantics & Failure Scenarios¶
This page is the single operational reference for running db.trigger /
PollTrigger in production. It collects the runtime contract, duplicate
windows, lease and checkpoint timing, scale-out behavior, and recovery
procedures in one place.
If you only read one document before deploying the polling trigger, read this one.
The polling trigger is a pseudo trigger. It is not a native Azure Functions trigger and not a database trigger. It runs as a plain timer trigger that polls the source on every tick. Delivery is at-least-once. Handlers MUST be idempotent.
For the formal contract see Semantics. For the on-disk state format see Checkpoint / Lease Spec.
1. Delivery Guarantee¶
PollTrigger and SqlAlchemySource deliver each row change to your handler
at least once. This is the only delivery guarantee the framework promises.
Concretely:
- A row that remains visible to the source query and satisfies the source preconditions will be delivered to the handler at least once before the checkpoint advances past it. Until the checkpoint commit for the batch containing that row succeeds, the same row may be delivered more than once (process crash, lease loss, retry after a failed commit, etc.).
- After the checkpoint commit succeeds, the framework will not intentionally re-fetch rows at or below that checkpoint on subsequent ticks.
- The framework does not provide exactly-once delivery, cross-instance global deduplication, or cross-database transactional acknowledgment.
This guarantee holds only when the source preconditions in Semantics §1.2 are met: monotonic non-decreasing cursor, stable PK with total ordering, deterministic query.
Handlers must therefore be idempotent. See Idempotent Handler Pattern below.
2. Tick Lifecycle¶
Every timer firing executes a single tick. The runner performs the following
steps in order. Each step has well-defined failure behavior — see
§5 Failure Scenarios.
1. acquire_lease (CAS write to state blob)
2. load_checkpoint (read state blob)
3. for batch_idx in range(max_batches_per_tick):
3a. source.fetch(cursor, batch_size)
3b. if no rows -> break
3c. normalize raw records into RowChange events
3d. invoke handler(events[, context])
3e. commit_checkpoint (CAS write to state blob)
4. release_lease (best effort)
Key invariants:
- The handler is invoked only after fetch + normalize succeed. A fetch failure never reaches the handler.
- The checkpoint is only advanced after the handler returns successfully. Handler return ≠ acknowledgment; checkpoint commit success = acknowledgment.
- All state mutations are CAS writes against the same state blob. A stale owner cannot silently overwrite a newer owner's checkpoint.
3. Checkpoint Commit Timing¶
The checkpoint advances after handler success and only via CAS commit.
| Phase | Checkpoint state | Outcome on crash |
|---|---|---|
| Before fetch | Last committed checkpoint | No effect |
| After fetch, before handler | Last committed checkpoint | Same batch re-fetched on next tick |
| Handler running | Last committed checkpoint | Whole batch re-fetched on next tick |
| Handler returned, before commit | Last committed checkpoint | Whole batch re-delivered on next tick |
| Commit succeeded | New checkpoint | Subsequent ticks resume from new checkpoint |
| Commit ambiguous (timeout) | Unknown | Resolved by reload on next tick; worst case duplicate, never loss |
Implication for replay: any side effects performed by the handler before the commit succeeds may be replayed. Plan side effects accordingly (idempotent writes, upserts, dedup keys).
4. Duplicate Window Reference¶
The polling trigger can produce duplicates (or full re-deliveries of a batch) in the following windows. This is the authoritative list — every "why does my handler see the same row twice" question maps to one of these.
| Window | Cause | Re-delivery scope | Detectable from handler? |
|---|---|---|---|
| W1. Handler success, commit failure | Checkpoint CAS failed (network, transient blob error, lease lost) | Entire batch | No — event is identical |
| W2. Crash after fetch, before handler returns | Process killed mid-batch (instance recycle, OOM, deploy) | Entire batch | No |
| W3. Crash with partial side effects | Handler did N of M writes, then crashed | Entire batch (M writes redone) | No |
| W4. Lease lost during processing | Heartbeat / commit CAS rejected because another instance acquired the lease | Entire batch (next owner re-fetches) | Indirectly — LostLeaseError logged |
| W5. Commit response timeout (ambiguous) | Network timeout on the commit blob write; commit may or may not have persisted | Entire batch (only if commit actually failed) | No |
| W6. Redeploy / restart | New instance starts from last committed checkpoint | Whole or partial in-flight batch | No |
| W7. Cursor precision collapse | Multiple updates within one cursor tick collapse into one | Latest state only — earlier intermediate states lost | Yes (only one event arrives) |
Windows that cannot produce duplicates within this framework:
- A successful commit followed by a clean shutdown. The next tick will only see rows strictly newer than the committed cursor + tiebreaker PK.
- Two ticks racing on the same instance —
lease_ttl_secondsand the single state blob CAS prevent this.
For the matching state-machine view see Semantics §12 Failure Matrix and §13 Duplicate and Reprocessing Windows.
5. Failure Scenarios¶
This section maps each failure to the runtime behavior, the metric / log emitted, and the operator action (if any).
5.1 Lease cannot be acquired¶
Cause: another instance currently holds the lease and it has not expired.
Behavior: tick() returns 0 immediately. LeaseAcquireError is raised
internally, caught, and logged at DEBUG. No handler invocation. No
checkpoint change. The next timer firing retries normally.
Operator action: none required. Multiple instances polling the same source is the expected scale-out shape — exactly one tick runs at a time.
5.2 Checkpoint blob load fails¶
Cause: transient blob storage error, missing container, RBAC issue, network partition.
Behavior: FetchError raised. Handler is not invoked. Lease is
released. Next tick retries.
Operator action: verify storage account connectivity and the RBAC role on
the db-state container.
5.3 Source fetch fails¶
Cause: database unreachable, query error, driver-level exception.
Behavior: FetchError raised. Handler is not invoked. Lease is
released. Next tick retries (the cursor has not advanced).
Operator action: investigate the source database. Polling will resume automatically once fetches succeed.
5.4 Record normalization fails¶
Cause: missing cursor or PK column in returned rows; non-serializable cursor type.
Behavior: SerializationError raised. Handler is not invoked. The
cursor does not advance — the same fetch will fail again next tick. This
is a poison configuration, not a transient failure.
Operator action: fix the source query / table / SqlAlchemySource
configuration. The trigger cannot make progress until this is resolved.
5.5 Handler raises¶
Cause: business logic error, downstream sink unavailable, etc.
Behavior: HandlerError raised. Checkpoint does not advance. Lease is
released. The same batch will be re-fetched and re-delivered on the next tick.
Operator action: if the failure is transient (network, sink restart) the trigger self-heals. If the same batch fails repeatedly you have a poison batch (see §5.9).
5.6 Checkpoint commit fails (lease lost)¶
Cause: another instance acquired the lease while the handler was running and bumped the fencing token.
Behavior: LostLeaseError is raised on commit. Handler results are
discarded; the next owner re-fetches the same batch. This is window W4.
Operator action: none — this is the lease protocol working correctly.
Reduce duplication risk by ensuring lease_ttl_seconds exceeds the worst-case
handler duration; see §7.
5.7 Checkpoint commit fails (other)¶
Cause: transient blob error, ETag mismatch, network timeout.
Behavior: CommitError raised. Handler already ran. The same batch
will be re-delivered on the next tick. This is window W1 (pure failure)
or W5 (timeout — commit may or may not have actually persisted).
Operator action: none required. Handler must be idempotent so the replay is safe.
5.8 Heartbeat / lease loss mid-handler¶
The current implementation does not perform an in-flight heartbeat from inside
a long-running handler. If handler_duration > lease_ttl_seconds:
- Another instance may acquire the lease via the expiry + grace window in
BlobCheckpointStore. - The original handler's commit will fail with
LostLeaseError(window W4). - The new owner will re-fetch and re-deliver the same batch.
Operator action: size lease_ttl_seconds so lease_ttl_seconds > p99
handler duration + commit time + safety margin. See
§7.
5.9 Poison batch / permanent handler failure¶
Cause: a batch deterministically fails the handler (bad row, schema mismatch, downstream rejection that will never recover).
Behavior: the same batch is reprocessed indefinitely. The trigger does not auto-quarantine in MVP.
Operator action (MVP):
- Identify the poison batch from
event=handler_failedlogs (look for the samecheckpoint_afterrepeated across ticks). - Either fix the source data, fix the handler, or manually advance the checkpoint in the state blob to skip past the poison row. Treat manual advance as data loss — log it.
- Track quarantine sink work in the project roadmap.
5.10 Storage / BlobCheckpointStore unavailability¶
Cause: Azure Blob Storage outage, expired credentials, deleted container.
Behavior: lease acquisition fails, ticks no-op (logged at DEBUG/ERROR depending on the failure), no handler invocation, no checkpoint change. Polling resumes automatically when storage becomes available.
Operator action: restore storage access. There is no special recovery required — the last committed checkpoint is intact and polling resumes from it.
6. Partial-Batch Failure Behavior¶
PollTrigger treats a batch as an atomic unit:
- If any row in the batch causes the handler to raise, the entire batch is considered failed and the checkpoint does not advance.
- The handler will be re-invoked on the next tick with the same events
(same
pk,cursor,op,before,after). - There is no row-by-row commit, no quarantine of the failing row, and no automatic batch splitting in MVP.
If you need finer granularity, structure your handler to:
- Catch per-row exceptions inside the handler.
- Route failures to your own dead-letter sink (queue, table, log).
- Let the handler return successfully so the checkpoint advances past the batch.
This pushes the partial-failure decision to the handler, where you can apply business rules.
7. Tuning lease_ttl_seconds and timer interval¶
The two timing knobs you control are:
- The timer schedule on the wrapping
@app.schedule(...)decorator. lease_ttl_secondsonPollTrigger/db.trigger(...)(default120).
Recommended sizing:
lease_ttl_seconds > p99(fetch + handler + commit) + safety_margin (~30s)
timer_interval >= lease_ttl_seconds / 2
Reasoning:
lease_ttl_secondsmust outlast the worst-case tick. If the handler runs longer than the TTL, another instance can steal the lease and you fall into window W4.- The timer interval should be at least half the TTL so a single instance
comfortably renews ownership across ticks. Faster timers under contention
just produce more
LeaseAcquireErrorno-ops. - Use the lease grace window in
BlobCheckpointStore(min(ttl * 0.5, 5s)) as your buffer for clock skew.
If you cannot bound your handler duration, split the work: write events to a queue inside the handler and process them asynchronously downstream.
8. Timer Overlap and Scale-Out¶
8.1 Timer overlap on a single instance¶
Azure Functions timers can overlap if use_monitor=False and a tick takes
longer than the schedule interval. With the polling trigger this is safe:
- The first tick holds the lease.
- The second (overlapping) tick calls
acquire_lease, getsLeaseAcquireError, and immediately returns0. No handler runs. No checkpoint change.
We still recommend use_monitor=True so the Functions host serializes ticks
where possible.
8.2 Multiple instances polling the same source¶
This is the supported scale-out shape and the purpose of the lease.
- All instances point at the same state blob (same poller name, same
checkpoint store). The
source_fingerprintfield guarantees they agree on the source definition. - On every tick, each instance attempts
acquire_lease. Exactly one wins. The losers no-op. - If the winner's handler runs longer than
lease_ttl_seconds, the lease becomes stealable after the grace window. A new instance takes over and the original owner's commit is rejected via fencing token mismatch (W4).
Do not point multiple instances at different state blobs for the same source. That deliberately runs the source twice and produces duplicates by construction.
8.3 Multiple pollers on the same source¶
If you want independent consumers of the same source (e.g. a backfill
poller alongside the live one), give them distinct name values. Each
gets its own state blob and its own checkpoint. They do not coordinate, and
each delivers every row independently.
9. Idempotent Handler Pattern¶
Because every duplicate window in §4 replays
the same RowChange events, you can dedupe with a stable key derived from
the event itself. The recommended dedup key is:
event.pk is the source's primary key dict. event.cursor is the value of
the source's cursor column at the time the event was emitted. Together they
uniquely identify a single source-side state, even if the row is updated
again later.
Three idiomatic patterns:
9.1 Upsert at the sink¶
@db.trigger(arg_name="events", source=source, checkpoint_store=checkpoint_store)
@db.output(
"out",
url="%DEST_DB_URL%",
table="processed_orders",
action="upsert",
conflict_columns=["order_id", "cursor"],
)
def orders_poll(timer, events: list[RowChange], out: DbOut) -> None:
out.set([
{
"order_id": e.pk["id"],
"cursor": str(e.cursor),
"after": e.after,
}
for e in events
])
Replays collide on (order_id, cursor) and become no-ops.
9.2 Processed-events table¶
Maintain a processed_events(poller_name, pk_hash, cursor) table with a
unique constraint. In a transaction:
- Insert the dedup row. If it conflicts, skip (already processed).
- Apply the side effect.
This works for sinks that are not natively upsert-friendly (HTTP APIs, external systems).
9.3 Downstream idempotency key¶
Pass f"{poller_name}:{pk_hash}:{cursor}" to a downstream system that
supports idempotency keys (Stripe, many message brokers, custom HTTP APIs).
10. Operational Checklist¶
A short summary is below. The full pre-deployment checklist — including runbook items, observability alert thresholds, and pool-configuration checks — lives in Production Checklist — Polling Trigger.
Before promoting a db.trigger to production:
- [ ] Confirm
lease_ttl_seconds > p99 handler duration + commit time + 30s. - [ ] Confirm timer interval ≥
lease_ttl_seconds / 2. - [ ] Confirm the handler is idempotent under §4.
- [ ] Confirm
BlobCheckpointStorehas its own dedicated container and the Function App identity has only the RBAC needed for that container. - [ ] Confirm the source query has a stable
ORDER BY cursor ASC, pk ASCand the cursor column is monotonically non-decreasing. - [ ] Confirm metrics are wired (
MetricsCollector) andfailures_total,lag_seconds,last_success_timestamphave alerts. - [ ] Confirm the runbook covers manual checkpoint advance for poison batches.
11. See Also¶
- Production Checklist — Polling Trigger — full pre-deployment checklist with runbook items.
- Semantics — formal delivery contract, ordering, cursor, failure matrix.
- Checkpoint / Lease Spec — on-disk state format, CAS algorithm, fencing tokens.
- Architecture — overall component layout.
- ADR-004 At-Least-Once Default — why this is the chosen guarantee level.