Production Checklist — Polling Trigger¶
This page is the pre-deployment checklist for running db.trigger /
PollTrigger in production. It is the operator-runnable companion to
Polling Runtime & Failure Scenarios and
EngineProvider & Pooling Guidance. Walk
through every item below before promoting a polling trigger to production.
If you skipped the runtime semantics doc, read at least §1 Delivery Guarantee and §4 Duplicate Window Reference first. None of the items here make sense in isolation.
1. Handler correctness¶
- [ ] The handler is idempotent. A redelivery of any
RowChangeproduces the same final state at the sink. Verify with the duplicate window reference. - [ ] A dedup key is documented. The recommended default is
(poller_name, event.pk, event.cursor). If the sink does not natively support upsert, aprocessed_eventstable with a unique constraint on the dedup key is in place. - [ ] No partial in-batch state survives a handler exception. If the handler raises mid-batch, any side effects already performed are either transactional (rolled back), idempotent on replay, or explicitly routed to a dead-letter sink.
- [ ] Async handlers offload blocking work correctly. If the handler is
async def, it does not block the event loop on long sync work (the package already wraps DB calls inasyncio.to_thread; user code must do the same for its own blocking calls).
2. Source design¶
- [ ] The cursor column is monotonically non-decreasing on every
mutation you care about.
created_atalone is not sufficient if rows are mutated in place — useupdated_atmaintained by aBEFORE INSERT OR UPDATEtrigger, aversioncolumn, or an outbox pattern. See Semantics §1.2. - [ ] The cursor column is indexed. A composite index on
(cursor_column, pk_columns...)is present so the source queryWHERE (cursor, pk) > (last_cursor, last_pk) ORDER BY cursor, pk LIMIT batch_sizeruns as an index scan, not a sort over the whole table. - [ ] The PK columns are stable and totally orderable. Tuples of stable surrogate keys (BIGINT, UUID v7) are fine; mutable natural keys are not.
- [ ] Hard deletes are accounted for. If the source allows hard deletes, you have either a soft-delete column, a tombstone table, or accept that hard deletes are not detected by the polling trigger. See Semantics §4.
- [ ] Backfill uses a separate
nameand a separate state blob. Do not point a backfill poller at the live poller's checkpoint. See Semantics §11.
3. Lease and timer sizing¶
- [ ]
lease_ttl_seconds > p99(fetch_ms + handler_ms + commit_ms) + 30s. - [ ]
timer_interval >= lease_ttl_seconds / 2. - [ ]
batch_sizechosen so that one batch's worst-case handler duration stays well belowlease_ttl_seconds. The default100is a starting point; lower it before you raiselease_ttl_seconds. - [ ]
max_batches_per_tickmatches your throughput needs. Increasing it raises tick duration linearly — recompute the lease budget if you change it. - [ ] You have measured
p99handler duration in a load test or in production with a low-traffic poller, not just guessed it. The runtime emitsazfdb_handler_duration_msas a metric (see §6).
See the formula and reasoning in Polling Runtime §7.
4. Engine and pool configuration¶
- [ ] A module-level
EngineProvideris shared across the source and every binding that targets the same database. - [ ]
engine_kwargsis identical across bindings that should share a pool — otherwise the cache key splits and you build extra engines (see EngineProvider §3.2). - [ ]
pool_pre_ping=Trueis set for every managed-database binding. - [ ]
pool_recycleis set below the database's server-side idle timeout (defaults: PG Flexible 5 min → 240s; MySQL 8h →1800; Azure SQL ~30 min →1500). - [ ]
(pool_size + max_overflow) × max_function_app_instances × workers_per_instancestays well below the database'smax_connectionsceiling. - [ ]
pool_timeoutis set explicitly (default 30s); a queue-bound function should fail fast rather than hang on a saturated pool.
5. Checkpoint blob and identity¶
- [ ] Dedicated container (default
db-state) — not shared withazure-webjobs-hostsor other system containers. - [ ] Container is pre-created in production with versioning / soft-delete enabled per your storage account's data-protection policy. (Azurite auto-creates; production does not.)
- [ ] Function App identity has scoped RBAC:
Storage Blob Data Contributoron thedb-statecontainer only. Avoid account-wide roles. - [ ] One state blob per production poller. No instance points at
another poller's blob. Confirm with
state/{app_name}/{poller_name}.json. - [ ]
source_fingerprintis unchanged from last deploy — if you changedtable,cursor_column,pk_columns, or filters, the fingerprint mismatch will reject ticks until you reset deliberately. See Checkpoint / Lease Spec §9. - [ ] Storage retry policy on
ContainerClientmatches the timer schedule (the default Azure SDK retry is fine for ≥1-minute timer intervals; tighten for sub-minute schedules).
6. Observability¶
A MetricsCollector is wired to your metrics backend, and the following
signals have alerts. All metrics are emitted with the azfdb_ prefix
(see src/azure_functions_db/observability.py
for the canonical names) and are labeled with poller_name.
- [ ]
azfdb_failures_total— non-zero rate over a 5–10 min window pages on-call. - [ ]
azfdb_lag_seconds— gauge exceeding2 × timer_intervalfor more than 2 ticks indicates the trigger is falling behind. - [ ]
azfdb_last_success_timestamp—now - last_success > 3 × timer_intervalindicates the trigger is stuck (no successful tick). - [ ]
azfdb_batches_total{result="failure"}— repeating failures on the samecheckpoint_afterindicate a poison batch (see §7). - [ ] Structured logs (
event=tick_complete,event=handler_failed,event=commit_failed,event=lease_acquire_failed) flow into your log store withpoller_nameandinvocation_idsearchable. - [ ] A dashboard shows
azfdb_handler_duration_ms,azfdb_commit_duration_ms, andazfdb_batch_sizepercentiles perpoller_nameso you can detect drift before it breacheslease_ttl_seconds.
For the metric inventory see
src/azure_functions_db/observability.py
and the README Observability section.
7. Runbook items¶
The on-call runbook covers each of the following recovery paths.
7.1 Poison batch (same batch fails repeatedly)¶
- Identify the failing batch: search for
event=handler_failedwith the samecheckpoint_after.cursorrepeated across ticks. - Decide the resolution:
- Fix forward — patch the handler or the source row, redeploy. The next tick re-delivers the batch and succeeds.
- Skip forward (data loss) — update the state blob to advance
checkpoint.cursorpast the poison row. Document this as a data incident. - There is no automatic quarantine sink in MVP. See Polling Runtime §5.9.
7.2 Lost lease / fencing rejection¶
- Symptom:
LostLeaseErrorin logs,azfdb_failures_total{error_type="LostLeaseError"}spiking. - Most common cause: handler duration exceeded
lease_ttl_seconds. Checkazfdb_handler_duration_msp99 againstlease_ttl_seconds. - Resolution: raise
lease_ttl_seconds, lowerbatch_size, or split long-running side effects into a queue + worker pattern.
7.3 Storage outage¶
- Symptom:
event=lease_acquire_failedfor every tick, no checkpoint movement. - The trigger self-heals once storage recovers. The last committed checkpoint is intact.
- Confirm the storage account is reachable and the Function App identity still has the scoped RBAC role.
7.4 Source fingerprint mismatch after migration¶
- Symptom:
FingerprintMismatchErroron every tick after a schema migration that changedtable,cursor_column,pk_columns, or filters. - Decide whether to resume from the existing checkpoint (only safe
if the cursor semantics did not change) or reset and replay (use a
new
namefor the poller, point at a new state blob, decide whether to backfill). - There is no implicit reset in MVP. See Checkpoint / Lease Spec §10.
7.5 Manual checkpoint advance¶
- Last resort. Treat as a documented data incident.
- Acquire the lease (or wait for it to expire).
- Read the state blob, edit
checkpoint.cursorandcheckpoint.last_successful_batch_id, write back with the matching ETag. - Capture the before/after blob in the incident ticket.
8. Pre-deploy smoke¶
The following smoke runs against the production environment before traffic is enabled:
- [ ] Deploy with the timer disabled for the smoke. The recommended
mechanism on the v2 model is the per-function disable app setting
(
AzureWebJobs.<FUNCTION_NAME>.Disabled=true); a separate slot or a dedicated smoke environment also works. Avoid commenting out the@app.scheduledecorator — that's a code change, not an operational toggle. Verify the Function App boots and theEngineProviderresolves the URL from app settings. - [ ] Manually invoke the function once with a fixed timer payload.
Verify a single successful tick:
event=tick_complete,result=success,total_processed=0(no rows yet) or the expected backfill count. - [ ] Verify the state blob exists and contains the expected
source_fingerprintand an initialcheckpoint. - [ ] Re-enable the timer (
AzureWebJobs.<FUNCTION_NAME>.Disabled=falseor remove the setting).
See Also¶
- Polling Runtime & Failure Scenarios — operational reference.
- EngineProvider & Pooling Guidance — pool sizing detail.
- Semantics — formal contract.
- Checkpoint / Lease Spec — state blob format and CAS algorithm.