Abhinav Yadav

Connecting Custom AI Agents to OpenWebUI: Auth, Latency, and API Design

January 24, 2026
AIOpenWebUIAPI DesignAuthenticationFastAPIIntegration
Connecting Custom AI Agents to OpenWebUI: Auth, Latency, and API Design

Why OpenWebUI as the Interface Layer

OpenWebUI is an open-source, self-hostable chat interface that supports the OpenAI API spec. That compatibility means any system that implements
— including your custom agent — can be plugged into OpenWebUI as a backend model.
For VirtuAI, this was the integration path: deploy the agent as a FastAPI service exposing an OpenAI-compatible API, register it in OpenWebUI as a custom model, and users get a polished chat interface without building one from scratch.

The OpenAI-Compatible API Contract

Your agent needs to implement two endpoints:
The chat completions request shape:

Implementing the Endpoint with FastAPI

Streaming: The UX Non-Negotiable

Non-streaming responses for agentic tasks create terrible UX — the user sees nothing for 10–30 seconds while the agent runs tool calls.
OpenAI's streaming format uses Server-Sent Events with
lines:
For agentic workflows where tool calls happen before generation, streamstatus updates during the tool execution phase: "Querying VM fleet...", "Analysing utilisation metrics...". This keeps the user informed and dramatically improves perceived performance.

Authentication Architecture

OpenWebUI → custom agent authentication has two layers:
Layer 1: OpenWebUI to Agent API
OpenWebUI passes a bearer token in the
header. Your agent verifies this:
Layer 2: Agent to Azure Resources
The agent itself needs to authenticate to Azure APIs. Use Managed Identity when deployed on Azure, never hardcoded credentials:

API Latency Management

Agentic requests are inherently slow — multiple LLM calls, multiple Azure API calls. The strategies we use:
1. Parallel tool execution — when tool calls have no dependencies, run them concurrently:
2. Response caching — Azure resource queries are cached for 5 minutes:
3. Timeout budgets — each tool call has a hard timeout:
The result: median end-to-end latency of 4.2 seconds for simple queries, 12–18 seconds for complex multi-tool workflows — all streamed so the user sees progress the entire time.