Published on

Why LLM Calls Are Not Reproducible (and How to Fix It)

Authors

Why LLM Calls Are Not Reproducible

LLM outputs are inherently non-deterministic. Even with the same prompt, results can vary across runs. While parameters like temperature introduce randomness, the problem goes deeper than sampling alone.

Many engineers assume that setting temperature = 0 guarantees deterministic output. In practice, this is not always true.


Why Temperature = 0 Still Isn’t Deterministic

Even with greedy decoding, identical inputs may still produce different outputs due to:

  • Numerical instability: GPU/TPU computations involve floating-point operations that are not strictly deterministic across runs or hardware
  • Parallelism & kernel scheduling: small differences in execution order can lead to different token selections when probabilities are close
  • Model serving infrastructure: load balancing across replicas may introduce slight differences (e.g., different hardware, quantization, or optimizations)
  • Token tie-breaking: when multiple tokens have near-identical probabilities, the choice may not be stable

As a result, temperature = 0 reduces randomness but does not guarantee reproducibility.


What Makes Reproducibility Hard

Beyond decoding, modern LLM systems introduce additional variability:

  • Model version drift: providers may update models silently
  • Context changes: system prompts or history may differ
  • External dependencies: RAG results, tools, or APIs may change over time

Replaying a request is therefore not just about re-sending a prompt.


How to Design for Replay

To enable reproducibility, you must capture the full execution context, including:

  • Prompt and system instructions
  • Model name and version
  • Inference parameters (temperature, top_p, etc.)
  • Retrieved documents (for RAG)
  • Tool inputs and outputs

A practical approach is to model each request as a trace tree, where each node records:

type Node = {
  id: string;
  parentId?: string;
  input: any;
  output: any;
  metadata: {
    model?: string;
    tokens?: number;
    latency?: number;
  };
};

This allows step-by-step replay and debugging.


Snapshot vs Mock

For external dependencies, two strategies are common:

  • Snapshot: store exact data used at runtime (e.g., retrieved docs)
  • Mock: simulate tool/API responses during replay

Both aim to eliminate time-based variability.

Key Takeaway

Reproducibility in LLM systems is not about rerunning the model—it is about reconstructing the original execution environment. Even with temperature = 0, true determinism is not guaranteed, so systems must be designed with observability and replay in mind.