Skip to main content
Suiri uses default, pre-selected quantization settings for each model to balance inference efficiency, cost, and output quality.

What is quantization?

Quantization refers to the numerical precision used to represent model weights. While different quantization techniques exist, Suiri selects and manages these settings internally for each model. Developers do not need to configure or tune quantization parameters.

Common quantization labels

Q4 (e.g., q4_0)

  • 4-bit weight representation
  • Lower memory footprint
  • Optimized for cost-efficient, low-latency inference

Q8

  • 8-bit weight representation
  • Higher numerical precision than Q4
  • Used for models requiring greater fidelity

FP8

  • 8-bit floating-point representation
  • Higher numerical fidelity than integer quantization
  • Typically used for larger or more capable models

What this means for developers

  • Quantization is handled automatically by Suiri.
  • There are no user-configurable quantization parameters.
  • Model behavior is defined by the selected model ID.
  • Changing quantization always results in a different model ID.
Suiri does not guarantee identical outputs across different models or quantization schemes. Developers should evaluate model behavior using their own workloads.