Blog

Why Direct LLM API Calls Break In Production

LLM request reliability is the practice of making a model request durable, retryable, deduplicated, and observable after it leaves the application process.

A category-defining guide to the boring failure modes behind LLM calls: timeouts, rate limits, worker restarts, duplicate retries, and unknown outcomes.

The demo version is simple

In a demo, calling an LLM directly is exactly the right move. The request starts, the response comes back, and the code stays easy to read.

That direct shape is why LLM APIs became so easy to adopt. The problem is not the request shape. The problem is what happens when that request becomes important enough that losing it creates real work for someone.

TypeScript
const response = await openai.chat.completions.create({
  model: "gpt-5-nano",
  messages: [{ role: "user", content: "Classify this ticket." }],
});

The production version is less kind

Production LLM calls fail in ordinary ways. The upstream API can return 429. The network can time out. A worker can restart halfway through the call. A user can submit the same action twice. A client can retry after the server has already accepted the work.

None of those failures are exotic. They are the normal conditions of software running outside a happy path.

  • A timeout leaves the app unsure whether the model call actually finished.
  • A client retry can create duplicate model work.
  • A worker restart can lose in-flight state.
  • A 429 or 5xx needs retry behavior, but not all failures should be retried.
  • A background task needs a request id so the app can inspect the outcome later.

The missing object is the request record

The key difference between a direct LLM call and a reliable LLM request is the existence of a durable request record.

A request record gives the operation a stable id, a status, attempts, safe error metadata, and a final result when one exists. That record lets the application recover without guessing.

How ReqRun approaches it

ReqRun keeps the OpenAI-compatible request shape but moves the reliability concerns around it into a small execution layer.

The application sends the request to ReqRun. ReqRun stores it, queues it, retries retryable failures, deduplicates repeated submissions with an idempotency key, and exposes status through GET /v1/requests/{id}.

TypeScript
const response = await reqrun.chat.completions.create({
  model: "gpt-5-nano",
  messages: [{ role: "user", content: "Classify this ticket." }],
  wait: true,
  idempotency_key: "ticket-842-classification",
});

if (response.object === "chat.completion.async") {
  await saveRequestId(response.id);
}

What this is not

This is not workflow orchestration. It is not model routing. It is not an analytics platform. Those products can be useful, but they solve different problems.

ReqRun focuses on the narrow reliability layer around one OpenAI-compatible LLM request. That narrowness is the point: when the failure mode is one model request, the fix should not require adopting a whole platform.

Common mistakes

The most common mistake is adding retry loops without idempotency. That can turn a timeout into duplicate upstream work.

The second mistake is treating a background LLM task as if it can always finish inside one HTTP request. Some calls finish quickly. Some need a durable request id and status lookup.