Deployments & Routing
Learn how to group whitelisted models into virtual endpoints and configure load-balancing policies, failovers, and caching.
Provider
↓
Global LLM Collection
↓
Workspace LLM Collection
├──────────────┐
▼ ▼
Direct Model Deployment ← You are here
\ /
\ /
▼ ▼
Virtual API Key
↓
Gateway RequestA Deployment is the final, client-facing tier of Infralo's gateway. It serves as a stable, virtual API endpoint that client applications call.
Instead of hardcoding a specific provider model (like gpt-4o) or managing raw API credentials inside your code, your application calls your virtual Infralo deployment (e.g., production-chat-gateway). Infralo then resolves this request and routes it to one or more of your workspace's whitelisted models based on your configured load-balancing, failover, and caching policies.
Gateway Access Paths
When sending requests to Infralo's OpenAI-compatible gateway, you can target models in two ways depending on your needs:
1. Direct Access (via Model Alias)
Use the whitelisted model's registered Alias Name as the model parameter in your API request.
- Behavior: Requests bypass deployment-level load-balancing rules, circuit breakers, and runtime modules. They are routed directly to the single specified LLM.
- Use case: One-off scripts, model-specific testing, or tasks requiring a dedicated target.
- Example request:
response = client.chat.completions.create( model="gpt-4o-prod", # The Model Alias Name messages=[{"role": "user", "content": "Direct connection test"}] )
2. Virtual Access (via Deployment Name)
Use the virtual Deployment Name as the model parameter.
- Behavior: The request passes through Infralo's virtualization layer, where load balancing, circuit breaking, failover rules, caching, and Runtime Modules (PRE and POST stages) are automatically applied.
- Use case: Production environments requiring high availability, cost management, data compliance, or prompt styling.
- Example request:
response = client.chat.completions.create( model="production-chat-gateway", # The Virtual Deployment Name messages=[{"role": "user", "content": "Production connection"}] )
Load Balancer (LB) Presets
Each deployment is configured with a load-balancing mode that dictates how incoming API requests are distributed. Infralo provides three built-in presets and a fully custom option:
| Mode | Routing Strategy | Target Goal | Typical Use Case |
|---|---|---|---|
| Balanced | Least-Busy / Round-Robin | Even distribution across healthy endpoints. | Standard production workloads spreading load to bypass rate limits. |
| Performance | Latency-Optimized | Routes requests to the lowest-latency model. | Real-time chat interfaces, autocomplete, and time-sensitive tasks. |
| Cost Saving | Price-Optimized | Prefers the cheapest model with larger buffers. | Background batch processing, summarization, and cost-constrained tasks. |
| Custom | Fully Manual | Full control over retry, rate limit, and circuit breaker parameters. | Complex enterprise deployments requiring custom failover tolerances. |
Advanced Configurations
For custom deployments (or to understand the underlying configurations of the presets), Infralo exposes several engine-level parameters:
1. Retry Logic
Configures automated retries when downstream model providers return transient errors.
- Max Retries: The number of times the gateway will attempt to resend a failed request before giving up.
- Failover Backoff: The delay interval between retries to prevent overwhelming a recovering provider.
2. Circuit Breaker
Prevents degraded or overloaded model endpoints from choking the gateway.
- Failure Threshold: The number of consecutive failed requests (such as timeouts,
5xxerrors, or rate limits) before Infralo "opens" the circuit breaker. - Cooldown Duration: The amount of time (in seconds) the circuit remains open. While open, Infralo halts all traffic to the failing model and routes it to backup targets, letting the provider recover.
3. Rate Limit Buffers
Configures a safety margin below the provider's official rate limits to avoid triggering HTTP 429 (Too Many Requests) errors.
- RPM Buffer: The percentage of requests per minute (e.g.
0.10for 10%) to hold back. - TPM Buffer: The percentage of tokens per minute to hold back.
4. Latency Tracking
Configures how Infralo calculates provider performance to route around slow instances.
- EMA Alpha (α): The Exponential Moving Average coefficient (between
0and1). A higher alpha places more weight on recent request times, allowing the routing engine to react quickly to sudden provider slowdowns.
5. Response Cache
A built-in caching layer to store and reuse response payloads for identical prompt inputs.
- Enable Cache: Toggles key-value caching.
- Time-to-Live (TTL): The duration (in seconds) that cached responses remain valid. Caching saves token costs and yields sub-millisecond response latencies.
Deployment Overview & Live Monitoring
When you navigate to a specific deployment in your workspace, you are presented with the Deployment Overview dashboard. This page serves as a live control center, displaying real-time operational telemetry and an interactive lineage graph of your traffic distribution.
1. Live Runtime Status
When a deployment is active, the dashboard automatically polls the gateway for current performance telemetry:
- Auto-Refresh Interval: By default, the dashboard polls every 30 seconds. You can adjust this rate via the refresh dropdown to
10s,15s,30s,1m, or2m, or disable auto-refresh entirely. Click the manual Refresh button at any time to pull the latest state instantly. - Disabled Banner: If a deployment is disabled, a status banner will alert you that traffic routing is suspended. Live polling and the lineage graph are deactivated until the deployment is re-enabled in settings.
2. Real-Time Telemetry Cards
The dashboard displays four aggregate live metric cards representing the health and request status of the deployment fleet:
- Deployment Health: Indicates overall status (
Healthy,Degraded, orUnhealthy) based on the status of downstream target models. - Active Requests: The total number of concurrent, in-flight API requests currently being processed by the target models.
- Avg Latency: The rolling Exponential Moving Average (EMA) latency across all functional target models in the deployment.
- Shared Consumers: The number of other virtual deployments that share this same fleet of target models.
3. Interactive LLM Lineage Graph
The dashboard displays an interactive, node-based flowchart mapping out how incoming requests flow through the virtualization layer to downstream model providers:
- Lineage Nodes: Represents the virtual deployment connection to individual target model nodes. Each target model node displays its provider logo (e.g. OpenAI, Google Gemini, Anthropic), name, and live performance metrics.
- Active Routing Weights: Connection lines display the load-balancing percentage currently routed to each model.
- Circuit Breaker Status: Badges on target model nodes indicate if the model is routing normally (status
Closed) or if its circuit breaker has tripped (statusOpen) due to failures. - Active request tracking: Individual model nodes display their local count of concurrent requests and average latency.