Skip to main content
Suiri provides an OpenAI-compatible inference API optimized for low-latency, cost-efficient AI inference at global scale.

Request Flow

Applications send stateless inference requests to Suiri, which routes each request to the optimal edge datacenter hosting the requested model. Responses are returned synchronously over HTTPS.

OpenAI-Compatible API

Suiri implements the OpenAI Chat Completions API schema, allowing existing applications to integrate with minimal or no code changes.
  • Endpoint: POST /v1/chat/completions
  • Request and response formats follow OpenAI conventions
  • Supports system, user, and assistant roles
  • Token usage is returned with every response
This compatibility enables you to:
  • Reuse existing OpenAI SDKs and clients
  • Swap API keys without refactoring application logic
  • Gradually migrate workloads to Suiri

Stateless Inference Model

Each inference request to Suiri is fully stateless:
  • Conversation history must be provided by the client
  • No server-side session memory is maintained
  • No prompts or responses are persisted after inference completes
This design ensures:
  • Predictable behavior
  • Strong privacy guarantees
  • Horizontal scalability across regions

Request Lifecycle

A typical inference request follows this path:
  1. Your application sends a request to the Suiri API
  2. Suiri authenticates the request using your API key
  3. The global control plane selects the optimal region based on: a. Model availability
    b. Proximity and latency
    c. Capacity and health
  4. The request is executed on the selected edge inference cluster
  5. The model response is returned to your application
  6. Token usage is recorded for billing and observability
All routing decisions are handled automatically by Suiri.

Token Accounting

Suiri uses token-based billing, similar to OpenAI and other inference platforms:
  • Input tokens: tokens in your prompt and messages
  • Output tokens: tokens generated by the model
  • Pricing is specified per 1,000 tokens
  • Usage details are included in every API response
Token counts are deterministic and model specific.

When to Use Suiri

Suiri is optimized for:
  • Latency-sensitive inference workloads
  • Stateless AI services and APIs
  • Mobile, web, and edge applications
  • Cost-conscious production deployments
Suiri is not designed for:
  • Long-lived conversational state
  • Model training or fine-tuning
  • Data storage or prompt retention