Global Observability

Platform-wide visibility, cross-workspace cost tracking, and provider analytics for Infralo administrators.

Global Observability provides centralized monitoring and system-wide telemetry across the entire Infralo platform. It is designed for platform administrators, DevOps, and operations teams who need to oversee resource utilization, verify provider health, and manage multi-tenant budgets.


Access & Permissions

Global observability tools require system-wide permissions (see Roles & Permissions) to prevent workspace members from viewing sensitive cross-workspace data.

  • Required Permissions:
    • View Global Logs: Grants access to the global logs console (/observability/logs).
    • View Global Metrics: Grants access to the global metrics dashboard (/observability/metrics).

Global Logs

The Global Logs console aggregates requests from all workspaces and deployments. It provides an operational log stream for system-level monitoring.

  • Platform Auditing: Search and filter requests across the entire gateway by timestamp, workspace, status, or model provider.
  • Troubleshooting Outages: Track error spikes across multiple workspaces to distinguish between provider outages (e.g., Anthropic returning 503) and workspace-specific configuration issues.

Global Metrics & Dashboards

The global metrics dashboard compiles platform-wide analytics into four functional tabs:

1. Overview

Aggregates high-level system indicators:

  • KPI Cards: Tracks total requests, total costs in USD, average latency, total tokens, and cache hit rate across all workspaces.
  • Trend Charts: Visualizes Request Volume, Cost, Latency, and Token Usage patterns over time.

2. Workspaces

Provides tenant-specific usage comparisons:

  • Compare total request volumes, cumulative costs, and cache hit ratios/savings across all workspaces (e.g., comparing development environments against production deployments).
  • Identify high-usage tenants or sudden budget spikes.

3. Models

Tracks vendor-level load and cost distributions:

  • Provider & Model Share: Visualizes request volume and cost distribution across OpenAI, Anthropic, Google, Azure, or Custom endpoints.
  • Model Efficiency: Lists request count, token volume, average latency, and costs grouped by model name.

4. Reliability

Monitors the overall health and error rates of the gateway:

  • Tracks total error rates and status code distributions (2xx, 4xx, 5xx).
  • Monitors error types and platform-wide retry activities to verify how failover rules are performing under load.

Typical Use Cases

  • Cost Allocation & Showback: Retrieve cost-per-workspace metrics to allocate AI spend back to specific teams or projects.
  • Canary & Routing Optimization: Audit provider latency and error trends to refine deployment routing weights or switch default providers.
  • SLA Auditing: Verify that Azure OpenAI or private custom endpoints are meeting performance agreements compared to public endpoints.

On this page