How it works

Suiri provides an OpenAI-compatible inference API optimized for low-latency, cost-efficient AI inference at global scale.

Request Flow

Applications send stateless inference requests to Suiri, which routes each request to the optimal edge datacenter hosting the requested model. Responses are returned synchronously over HTTPS.

OpenAI-Compatible API

Suiri implements the OpenAI Chat Completions API schema, allowing existing applications to integrate with minimal or no code changes.

Endpoint: POST /v1/chat/completions
Request and response formats follow OpenAI conventions
Supports system, user, and assistant roles
Token usage is returned with every response

This compatibility enables you to:

Reuse existing OpenAI SDKs and clients
Swap API keys without refactoring application logic
Gradually migrate workloads to Suiri

Stateless Inference Model

Each inference request to Suiri is fully stateless:

Conversation history must be provided by the client
No server-side session memory is maintained
No prompts or responses are persisted after inference completes

This design ensures:

Predictable behavior
Strong privacy guarantees
Horizontal scalability across regions

Request Lifecycle

A typical inference request follows this path:

Your application sends a request to the Suiri API
Suiri authenticates the request using your API key
The global control plane selects the optimal region based on: a. Model availability
b. Proximity and latency
c. Capacity and health
The request is executed on the selected edge inference cluster
The model response is returned to your application
Token usage is recorded for billing and observability

All routing decisions are handled automatically by Suiri.

Token Accounting

Suiri uses token-based billing, similar to OpenAI and other inference platforms:

Input tokens: tokens in your prompt and messages
Output tokens: tokens generated by the model
Pricing is specified per 1,000 tokens
Usage details are included in every API response

Token counts are deterministic and model specific.

When to Use Suiri

Suiri is optimized for:

Latency-sensitive inference workloads
Stateless AI services and APIs
Mobile, web, and edge applications
Cost-conscious production deployments

Suiri is not designed for:

Long-lived conversational state
Model training or fine-tuning
Data storage or prompt retention

Getting started

API

Models

Security & privacy

Troubleshooting

Support

How it works

Request Flow

OpenAI-Compatible API

Stateless Inference Model

Request Lifecycle

Token Accounting

When to Use Suiri

Getting started

API

Models

Security & privacy

Troubleshooting

Support

​Request Flow

​OpenAI-Compatible API

​Stateless Inference Model

​Request Lifecycle

​Token Accounting

​When to Use Suiri

Request Flow

OpenAI-Compatible API

Stateless Inference Model

Request Lifecycle

Token Accounting

When to Use Suiri