Retryable Idempotent Operations

Written 2020-11-16 — Updated 2022-09-30

Author: Brandur Leach
Source: https://brandur.org/idempotency-keys
If you can pull it off, it's best to just use single database transactions with Serializable Transactions if you need it, and design them in an idempotent fashion. But that can't always be done, especially when you must interact with external services.
The Idea of Idempotency Keys
- Operations such as this can't be made inherently idempotent, but they can have the client send an accompanying unique ID to track the operation. The server can then see if the operation with that ID (if the ID already exists) has succeeded, failed, etc. and act appropriately.
- When a request comes in, the server creates a new record for the idempotency key, and uses this to track the progress of the request.
- The response is saved with the idempotency key record, so if the client sends a request for the same key again, the server just returns the same response.
- The record has the following data:
  - The key
  - When it was last run
  - When it was last locked (if currently locked). This keeps multiple requests from trying to rerun it at the same time.
  - Parameters of the request, needed to rerun it and as a sanity check that the client isn't sending different requests with the same idempotency key.
  - Parameters of the response, if present, to return on repeated requests after final success or failure.
  - Current recovery point.
  - User ID of requester (optional but helpful)
Operation Phases
- Operations that talk to foreign services can also be phased. Each phase is recorded at a recovery point so we know exactly how far we got and where to start trying again. Groups of local operations between foreign operations that can all be put in a database transaction can be seems as a single phase.
- Putting any work that can be deferred into a background worker helps simplify this a lot too since there are fewer phases to track. We add the operation using Transactionally Staged Job Drain Queues as part of a phase, and let it execute independently.
- There are essentially two types of phases:
  - Atomic phases are those that talk to a database and comprise a single transaction.
    - A phase can end with a recovery point, which sets that point as the current one in the database. This should happen as part of the same transaction.
    - A phase can indicate that it's time to return a response to the user. Could be success or some error that can't be retried (e.g. bad user input that can't be processed, invalid credit card, etc.)
    - A phase can also indicate a NoOp, which just means to continue running and not set a recovery point.
  - Foreign state mutations are those that talk to any other service, internal or external.
- The rules for determining phases:
  - Upserting the idempotency key is its own phase.
  - Every foreign state mutation is a phase.
  - Anything that happens between two foreign state mutations is a single phase and should all be part of a single transaction for each group of things.
- After each phase, the idempotency key record's recovery point is updated so that it knows where to start from again.
Problems with non-idempotent foreign services
- When talking to foreign APIs that don't support idempotency on their own, we can run into issues on some types of failures. For example, on a time out error we may not know if our request actually went through or not. There isn't really a good way to handle this and you may just have to mark the transaction failed.
Additional useful processes
- A completer looks for unfinished processes that the client didn't try to finish, and tries to retry them.
- A reaper deletes old idempotent records. If an old operation is unfinished, it should log somewhere so that it can be seen.