Skip to content

Checkpoint / Lease Spec

1. Purpose

In a pseudo trigger, the most critical state consists of checkpoint and lease.

  • checkpoint: how far processing has successfully progressed
  • lease: who currently holds the processing authority

Core design decision: checkpoint and lease are stored in a single state blob, and all state changes are performed via ETag-based CAS (conditional write). This eliminates TOCTOU race conditions between separate blobs entirely.

2. Storage Selection

MVP

  • Azure Blob Storage

Rationale: - Easy to use alongside Azure Functions - Already available in most operational environments - Good cost/simplicity balance - Supports ETag-based conditional writes (If-Match)

Future Candidates

  • Azure Table Storage
  • Cosmos DB
  • SQL table

3. Blob Layout

Container: db-state

Blob path example:

state/{app_name}/{poller_name}.json

Instead of the legacy checkpoints/ + leases/ split, a single state blob is used.

4. State Document Format (Unified)

{
  "version": 1,
  "poller_name": "orders",
  "source_fingerprint": "sha256:...",

  "checkpoint": {
    "cursor": {
      "kind": "timestamp+pk",
      "value": "2026-04-07T01:23:45.123456Z",
      "tiebreaker": {
        "id": 12093
      }
    },
    "last_successful_batch_id": "batch_20260407_012346_0001",
    "updated_at": "2026-04-07T01:23:46.020000Z",
    "metadata": {
      "row_count": 100
    }
  },

  "lease": {
    "owner_id": "funcapp/instance-abc123",
    "fencing_token": 42,
    "acquired_at": "2026-04-07T01:23:00Z",
    "heartbeat_at": "2026-04-07T01:23:20Z",
    "expires_at": "2026-04-07T01:25:00Z"
  }
}

5. CAS-Based State Transitions

All state changes follow this pattern:

1. Read state blob → obtain (content, etag)
2. Apply changes (acquire lease, advance checkpoint, etc.)
3. Conditional write (If-Match: etag)
4. Success → complete
5. Failure (412 Precondition Failed) → retry or abort

This pattern applies equally to lease acquisition, heartbeat, and checkpoint commit.

6. Lease Acquisition Algorithm

1. Read state blob → (state, etag)
2. If state doesn't exist → create new state (fencing_token=1)
3. If lease is expired → increment fencing_token and set owner
4. Conditional write (If-Match: etag)
5. Success → lease acquisition confirmed
6. Failure → another instance acquired first, skip

Rules

  • Stealing a lease before expiry from another instance is forbidden
  • Use a safety margin to account for local clock skew
  • All lease changes are performed atomically via CAS

7. Heartbeat

A heartbeat is required because handlers may take a long time to complete.

Default rules: - heartbeat_interval < lease_ttl / 2 - Consider halting execution after n consecutive heartbeat failures - Heartbeat = CAS write to state blob (updating heartbeat_at and expires_at) - CAS failure = lease lost; handler must be stopped

8. Commit Algorithm

1. Read state blob → (state, etag)
2. Verify owner_id / fencing_token match
3. Update checkpoint (cursor, batch_id, updated_at)
4. Conditional write (If-Match: etag)
5. Success → checkpoint advance complete
6. Failure → CommitError (batch is unconfirmed; can be reprocessed)

Key Principles

  • Checkpoint and lease validation occur in the same CAS write
  • A stale owner's commit is automatically rejected by etag mismatch
  • No separate lease blob check required → no TOCTOU race

9. Source Fingerprint

The state records a hash of the source definition.

Includes: - DB URL (excluding password) - table/query - cursor column - PK columns - filters

If the source fingerprint changes, the default policy is: - Reject execution, or - Require an explicit reset/backfill

10. Reset Policy

Supported commands: - reset_to_beginning - reset_to_checkpoint(file) - reset_to_cursor(value, pk) - clone_checkpoint(new_poller_name)

Resets in production are dangerous; guardrails are required in the CLI.

11. Failure Scenarios

CAS Write Succeeds, Then Function Exits

  • Already committed
  • Next tick starts from the new checkpoint

Handler Succeeds, CAS Write Fails

  • Same batch can be reprocessed on the next tick
  • Duplicates may occur

CAS Write Response Timeout (Ambiguous State)

  • Commit success is uncertain
  • Resolved by reloading state on the next tick
  • Worst case: duplicate, no loss

Lease Expires and Another Instance Takes Over

  • Stale owner's commit attempt is automatically rejected by etag mismatch
  • Duplicates possible, loss prevented

Heartbeat CAS Failure

  • Treated as lease loss
  • Handler stops; no commit attempt is made

12. Operational Guidelines

  • One state blob per production poller
  • Use a separate poller_name for backfill (separate state blob)
  • Do not reuse state blobs after source definition changes
  • Minimize storage RBAC or connection string permissions