Getting started with Bifrost: why it's worth knowing about
An AI gateway can look a lot like a proxy with a nicer name. And for simple setups, that is more or less what it is. But in anything beyond a single-model prototype, the gateway becomes the layer where routing logic, access control and reliability actually live.
I replaced a previous LiteLLM setup with Bifrost after running into configuration friction and routing limitations that became hard to work around at scale. This post covers what Bifrost is, how to get it running locally with a config file and the features that make it worth understanding: routing, reliability and multi-tenant governance.
What AI gateways are good for #
If you only call one model from one provider, a gateway adds a layer you probably do not need. An SDK client is fine and there is little logic to abstract away.
That perspective changes with scale. Switching between providers without rewriting application code, handling automatic failover during provider outages, enforcing different rate limits per customer, tracking usage across teams. None of this is addressed well at the application level and it tends to accumulate as scattered logic across services if you add it to your system later on.
An AI gateway provides a centralized control plane between your application code and the underlying model providers. You give up some simplicity in exchange for policy enforcement, routing control and observability in one location.
What Bifrost is #
Bifrost is an open-source AI gateway built by Maxim. It exposes a single OpenAI-compatible API in front of 20+ model providers and handles routing, governance and reliability between them. The official docs walk through the full feature surface.
The part that made it compelling to evaluate is that the open-source tier includes features that other gateways, including LiteLLM, move to paid tiers.
Running Bifrost locally #
The easiest local setup is via Docker or npx. Using docker works as follows:
docker run -p 8080:8080 maximhq/bifrost
Once it is running, the gateway is available at http://localhost:8080, along with a web UI for configuration and request logs.
Bifrost starts without any config and accepts requests immediately, but a provider still needs to be registered before those requests will succeed. The quickest path without a config file is to add a provider key through the web UI, then send your first request using the provider/model format:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-5-mini",
"messages": [{"role": "user", "content": "Hello from Bifrost"}]
}'
A practical config.json #
Bifrost supports two configuration modes: UI and file-based using a config.json. The file-based mode is read at startup and is a clean way to version and share a reproducible setup. The config as well as the logs can be stored in sqlite or postgres.
To use it, pass it at startup:
# Docker with config file
docker run -p 8080:8080 \
-v $(pwd)/data:/app/data \
maximhq/bifrost
Bifrost looks for config.json inside the app directory (./data above). Here is a minimal configuration for two providers:
config.json
{
"client": {
"drop_excess_requests": false
},
"providers": {
"openai": {
"keys": [
{
"name": "openai-primary",
"value": "env.OPENAI_API_KEY",
"models": ["gpt-5.4", "gpt-5.1-codex-mini"],
"weight": 0.8
}
]
},
"anthropic": {
"keys": [
{
"name": "anthropic-primary",
"value": "env.ANTHROPIC_API_KEY",
"models": ["claude-4-6-sonnet", "claude-4-5-haiku"],
"weight": 0.2
}
]
}
},
"config_store": {
"enabled": true,
"type": "sqlite",
"config": {
"path": "./config.db"
}
}
}
The config_store block enables a persistent configuration store backed by a SQLite database at ./config.db. When enabled, Bifrost saves provider configurations (keys, settings, etc.) to this database, so changes made via the Web UI or API persist across restarts. Without it, configuration only lives in memory or in config.json
With this in place and providers registered, the gateway is ready for routing and governance configuration through either the API, UI or additional entries in the config file.
Routing #
The problem: You want a single entry point for multiple providers and you do not want to push routing decisions into your application code.
Bifrost decouples the model choice from your business logic in two ways.
The first is at the virtual key level. Each virtual key carries a provider_configs list that defines which providers and models it can route to, along with weights. A key configured with 80% OpenAI and 20% Anthropic will split traffic accordingly and if a provider becomes unavailable, requests fall back to the others automatically.
The second is CEL (Common Expression Language) routing rules. These evaluate conditions at request time (based on request headers, budget usage, rate limit percentages or other runtime state) and can override the provider or model dynamically. A practical example: if your application passes an x-region: EU header, a routing rule evaluating request.headers["x-region"] == "EU" can redirect that request to a model endpoint like Mistral’s EU deployment. Data residency is enforced at the gateway, not scattered across code.
This pattern is particularly useful when running agents with frameworks like pydantic-ai, where the agent’s model selection can be left entirely outside the application code. The agent passes a header indicating what kind of model it needs, a reasoning model, a fast model, a regionally constrained model or a combination of these and the gateway resolves the actual provider and endpoint. Swapping providers or upgrading models in response to changing costs, reliability or availability becomes a configuration change, not a code change.
Reliability #
The problem: Providers (more often than they should) experience downtime and rate limits and rewriting retry logic per microservice is unsustainable.
Bifrost handles failover natively. If a primary provider returns an error or exhausts its rate limit, the gateway intercepts the failure and routes the request to the next configured provider. This is part of the weighted routing system — fallbacks are implicit when you configure multiple providers on a virtual key.
For anyone who has written provider-specific try/except blocks to handle 429 Too Many Requests across different SDK interfaces, moving this to the gateway is a meaningful simplification. The error handling exists once, in one place and applies uniformly. And the visibility in the UI and logs is a huge bonus.
Customers and virtual keys #
The problem: You need to enforce rate limits and budgets separately per team or external customer, without sharing a single API key across everything.
This is where Bifrost moves from useful to foundational. The governance model is hierarchical, with budget tracked independently at each level:
Customer
└── Team (optional)
└── Virtual Key
└── Provider Config
Each level has its own budget. When a request arrives, Bifrost checks every applicable level in order (Provider Config, Virtual Key, Team, Customer) and the request is blocked if any single level is exceeded. The same cost is then deducted from all of them. This means budgets are not a ceiling at one layer; they compound through the chain.
Rate limits work differently: they only exist at the Virtual Key and Provider Config levels. Customers and teams have budgets only — no rate limits.
Virtual keys (prefixed sk-bf-*) are the central access credential. Your application passes one via the x-bf-vk header rather than a provider API key directly. Each virtual key has its own budget and optional rate limits, and can be attached to either a customer directly or to a team within a customer.
Provider configs are the routing rules embedded inside a virtual key. Each entry in a VK’s provider_configs array specifies a provider name, a weight for load balancing, the models that key is allowed to call, and optionally its own per-provider budget and rate limits. As we’ve established above, a single VK can have multiple provider configs (for example, 80% OpenAI and 20% Anthropic) and Bifrost uses the weights to distribute traffic and fall back automatically if one provider is unavailable.
Teams are an optional intermediate layer between customers and virtual keys, useful when one customer has multiple internal groups with separate spend limits.
Customers are the top-level organizational entity. Think of a company, a tenant, an external client. They hold a budget that accumulates costs from all their virtual keys. A customer can have multiple virtual keys attached, which is useful when you need to rotate or revoke access without resetting the customer’s billing history.
Here is how to create a customer via the API:
curl -X POST \
http://localhost:8080/api/governance/customers \
-H "Content-Type: application/json" \
-d '{
"name": "Customer Name",
"budget": {
"max_limit": 100.00,
"reset_duration": "1M"
}
}'
And a virtual key attached to that customer:
curl -X POST \
http://localhost:8080/api/governance/virtual-keys \
-H "Content-Type: application/json" \
-d '{
"name": "Customer Name - Production",
"customer_id": "customer-name",
"provider_configs": [
{
"provider": "openai",
"weight": 1.0,
"allowed_models": ["gpt-5.4"]
}
],
"budget": {
"max_limit": 50.00,
"reset_duration": "1M"
},
"rate_limit": {
"request_max_limit": 60,
"request_reset_duration": "1m"
},
"is_active": true
}'
Any request made with this virtual key counts against both the key’s own $50 budget and the customer’s $100 ceiling. Rate limiting at 60 requests per minute is enforced at the key level. If you need to rotate the key, you create a new one and deactivate the old one leaving the customer’s budget tracking unaffected.
The distinction between the two matters specifically in the rotation case: a customer is a permanent entity in your system, while a virtual key is a revocable access credential.
A useful mental model for what belongs where: provider API keys and initial gateway configuration go in config.json and are loaded at startup, this is the static infrastructure layer. Customers, virtual keys and teams are runtime concerns managed through the API. When a new tenant is onboarded, you create a customer and issue a virtual key via HTTP; the virtual key references the providers already registered in your config. Tenants receive only the sk-bf-* key. LLM provider credentials never leave the gateway.
Why I switched from LiteLLM #
LiteLLM is a solid choice for getting an OpenAI-compatible proxy running quickly and I used it for exactly that. The friction shows up at the edges.
Routing configuration: LiteLLM supports fallbacks and basic load balancing, but more complex strategies such as percentage-based traffic splits, per-key provider restrictions, CEL-based runtime conditions and more are difficult or impossible to implement. Bifrost’s routing model handled these cases without additional code.
Governance and enterprise features: LiteLLM has virtual keys and basic spend limits, but true hierarchical budget management and SSO are restricted to its paid enterprise tier. Production deployments also often require external dependencies like Redis for caching. Bifrost includes full hierarchical governance, semantic caching, real-time request logs and metrics natively in the open-source version. For self-hosted teams, that difference is practical.
Performance at scale: Bifrost is written in Go, which sidesteps the Python GIL and makes a measurable difference under concurrent load. Bifrost claims significantly higher throughput and lower P99 latency at 5,000 RPS compared to LiteLLM. Performance overhead from a gateway might not seem like a big deal when a regular LLM request already takes seconds to complete, but under production load and traffic spikes, it is good to be on the safe side. Especially because there is no real downside to it.
On the other side, Bifrost does come with a few trade-offs. Its Go codebase means it is harder for Python-native AI teams to hack on or extend with custom middleware compared to LiteLLM. It also has a smaller community and supports fewer models out of the box (~20 compared to LiteLLM’s 100+), meaning if you heavily rely on niche inference hosts, depend on specific features of certain models or obscure model APIs, LiteLLM’s broader coverage might be the path of least resistance.
Conclusion #
When your AI integration scales beyond a single model or tenant, the gateway becomes your actual infrastructure. The alternative is scattering failover logic, model routing rules and budget tracking across your application code, where it inevitably becomes fragile and unmanageable.
Bifrost is worth evaluating because it moves these concerns to a dedicated control plane. The routing layer handles the complexity: automatic failovers, dynamic rules based on request context and hierarchical governance that isolates your billing metrics from your access credentials.
The open-source feature set, including real-time observability, semantic caching and full governance, is broad enough that most self-hosted use cases do not require an enterprise license. By cleanly separating static infrastructure in a config.json from dynamic runtime configuration via the API, your setup remains reproducible and strictly under your control.