The core rule
A retry engine is only useful if it does not create duplicate work. That rule shaped the implementation more than anything else.
ReqRun starts with durable storage, project-scoped idempotency, worker lock tokens, and an attempt record for every processing try.
The state machine stays small
The request lifecycle intentionally has only five states: queued, processing, retrying, completed, and failed.
Small state machines are easier to reason about during incidents. If a request is not terminal, the worker should eventually be able to claim it or schedule it for another attempt.
queued -> processing -> completed
queued -> processing -> retrying -> processing -> completed
queued -> processing -> failedClaiming work safely
The worker claims one request at a time. A claim sets status=processing, locked_at, and a lock_token. The worker only proceeds if it owns that token.
That check matters because stale workers, restarts, and repeated polling can otherwise create double execution. Stale lock recovery uses a lock timeout so abandoned processing rows can re-enter the claimable set.
Retryable is not the same as failed
ReqRun retries network errors, timeouts, HTTP 429, and HTTP 5xx responses. Most 4xx responses are terminal because the input usually needs to change.
This distinction keeps the system from hammering OpenAI with invalid requests while still recovering from temporary upstream pressure.
Attempts are product data
Attempts are not just logs. They are part of the product surface because developers need to see what happened without digging through process output.
A good attempt record can answer: when did the attempt start, did it finish, what status did it produce, did OpenAI return a status code, and what safe error code can the dashboard show?
- attempt_number
- started_at and finished_at
- status
- upstream_status_code
- duration_ms
- safe error code and short message
Backoff and jitter
Backoff gives the upstream time to recover. Jitter prevents many queued requests from retrying at exactly the same moment.
The policy does not need to be fancy for v1. The important part is that it is predictable, bounded by max retries, and visible through request status.
The practical lesson
Keep retry state boring. The useful fields are request status, attempts count, next_retry_at, last safe error code, and a timeline of attempts.
If those fields are correct, the dashboard and API can explain almost every request outcome without logging raw prompts in normal app logs.