Skip to main content

From standard to fallback: the completions API in 2026 and what's next

·11 mins

Until recently you could swap LLM providers with a config change. The completions API shape was so widely adopted that it felt like a standard. You wrote one client, set a base URL, picked a model name and routed traffic wherever pricing or latency made sense. Gateways like Bifrost and LiteLLM made the abstraction even cleaner. Same function call, same prompt, different provider, no rewrites.

That convenience is fading. OpenAI moved structured outputs from response_format to text.format inside its Responses API and marked the old JSON mode as legacy. Anthropic ships native tool use, extended thinking and computer use on its own API, with the Bedrock and Vertex deployments often lagging in feature parity by weeks or longer. Google added native grounding, file search and a media resolution setting to Gemini 3 (per-image/per-frame token budget for vision processing) that has no equivalent on the other providers. None of these are extensions of the completions schema. They are proprietary API surfaces that require you to build against that provider specifically.

Compatibility with the de facto standard was never an architectural choice. OpenAI shipped first, the tooling ecosystem grew around its API and any provider that wanted to reach developers had to match it. And for as long as base models were close substitutes this pattern held. Now each provider’s key features are exactly the parts the completions schema cannot represent, so the latest developments ship on a different API surface. The, so far, dominant completions API is not going away, but it will quickly become the fallback you reach for when you have no other choice.

What changed #

OpenAI’s Responses API: structured outputs no longer live in response_format. They moved to a text.format block with strict: true, which makes the schema enforced rather than suggested. JSON mode still exists but is now legacy. Web search, file search, computer use and code interpreter are exposed as native server-side tools that the model orchestrates without round-tripping through your application code. This is a different programming model from chat completions. You hand the model a goal and a set of tools and it returns a final answer plus a chain of intermediate calls. You do not see, or need to handle, the per-turn loop.

Anthropic’s native primitives: tool use is no longer a JSON convention layered on top of chat completions. It is a first-class block type that the model emits inline, with strict input schemas and parallel calls in a single turn. Extended thinking gives you a separate token stream you can route to a different storage budget. Computer use lets the model take screenshots, click and type against a sandboxed desktop. Each of these features depends on the Messages API shape and does not survive translation to the OpenAI completions schema, which is why a gateway that flattens to chat completions silently drops them.

Google’s Gemini 3: grounding ships as a native tool. You enable it on the request and the model consults Google Search, Maps, File Search or URL Context as part of its reasoning, then returns citations alongside the answer. Multimodal handling exposes a media_resolution knob that has no analog elsewhere, which trades token spend against fine-detail recognition for images and video frames. Deep Research and Deep Research Max, released in April 2026, fuse public web data with proprietary enterprise data through MCP, again as a native API surface rather than something a client library composes on top.

None of these are extensions of the chat completions schema. They are different products that happen to also accept text in and emit text out. The shared interface that used to make providers interchangeable now covers a fraction of what the providers actually offer.

Why it’s rational #

Until about a year ago the question for a model provider was “do we have the best base model?”. Frontier capability gaps were wide enough that a provider could win developers on raw quality and every release was a big leap forward. Now the gap has narrowed. On reasoning benchmarks, on coding leaderboards, tool use, or multilingual evaluations, the top providers are within a few points of each other and leadership changes seemingly based on who fits the benchmark better.

Now that base quality has converged, providers compete on capabilities instead. OpenAI is betting on agents that orchestrate server-side tools without your application managing the loop. Anthropic is betting on long-running coding and computer-use agents with strict tool schemas and explicit reasoning budgets. Google is betting on grounded research and native multimodal pipelines. None of those workloads fits into a shared API. Server-orchestrated tool loops, long-running coding agents and grounded multimodal pipelines each carry different states, require different inputs and produce different outputs. For this, each provider is building its API around the shape of its own product.

And of course there is the economic angle. Training base models is capital-intensive and margins on API access have been negative and under pressure. Most of the value the foundation models unlock is realized in the application layer: coding agents like Cursor, research agents like Perplexity, vertical specialists like Harvey. Native APIs let providers reach some of that value directly, with products like server-side tool loops and computer-use sandboxes competing for the work the model is doing anyway. Some of those same third-party products are now visibly under pressure. Figma’s stock fell roughly 7% on the day Anthropic launched Claude Design in April 2026 and is down 50% YTD and 80% since the day it went public less than a year ago. RAG startups built around third-party vector stores increasingly compete with native File Search and URL Context, and browser automation and RPA platforms face the same threat from computer use. Every native primitive is one less reason to integrate a separate SaaS, and the market has started pricing in the structural risk.

The new developer reality #

For developers, this means code now has to know which provider it is talking to, at least at the capability boundary. That might feel like a regression from the days of swapping providers with a config change, but it lets you express and achieve things that the unified API never could.

In practice that means expressing preferences rather than fixing a model. A coding agent declares a first choice based on whichever provider handles its workload best, with one or two backups behind it. A research agent does the same with a different ranking. The completions API remains an option for workloads that do not need a native feature, for deployments on self-hosted open-weights models or for moments when preferred providers are unreachable.

Picking a model means picking it for a workload, a deployment context and a moment in time. Once the question changes, the architecture has to follow. And the same goes for new models releases with updated capabilities.

The shifted role of gateways #

If providers are diverging on capability, then a gateway whose contract is “hide the differences” stops being useful. The very thing it was built to do, smooth the API surface into a single shape, now silently strips the features you went to that provider for in the first place. Anthropic’s extended thinking and computer use, OpenAI’s server-side tools, Gemini’s grounding and citations: none of these survive the flattening to a string for chat completions.

That does not make gateways obsolete because they do more than just basic routing:

  • Within-tier routing and native-shape translation. When the application picks a provider, the gateway resolves the request to its native shape and handles the in-tier details: which credential pool, which region, retries against transient errors, swaps to feature-equivalent alternatives like Azure OpenAI when OpenAI direct is unhealthy.
  • Observability across providers. Latency, token cost, error class, retry counts, all unified into one telemetry stream even when the underlying APIs are structurally different.
  • Governance. Virtual keys, per-tenant budgets, rate limits and audit trails that work consistently regardless of which provider a request lands on.

The cross-provider preference list itself sits one layer above. The agent framework, Pydantic AI’s FallbackModel for example, owns the decision of which provider to try first and which to fall back to next, because that is where the capability tradeoff is actually understood. The gateway sees one resolved request at a time and routes it within whatever capability tier the application asked for. The framework expresses the cross-provider preference, the gateway expresses everything else.

One constraint matters here: even reliability fallback is now capability-bounded. A gateway can fail over OpenAI to Azure OpenAI without telling you, because the API shape and feature set are identical. It cannot silently fail over native Anthropic Sonnet to Bedrock-hosted Sonnet, because extended thinking, native tool use and computer use are not yet available on AWS Bedrock. Fallbacks across capability tiers have to be explicit, with the application aware that a degraded provider might be answering.

The work of running an agent against multiple providers ends up split across three places. The application declares a preference list and a prompt. The framework builds the provider specific payload based on the preference list. The gateway tracks and routes the request to the resolved provider.

graph TB classDef primary fill:#e0f2fe,stroke:#0ea5e9,color:#0c4a6e classDef secondary fill:#ffedd5,stroke:#f97316,color:#7c2d12 subgraph Old["Old: completions api as the standard, gateway as adapter"] AppO["Application"]:::primary --> GwO["Gateway
(flatten / route)"]:::secondary GwO --> CA["Completions API"]:::secondary CA --> P1["OpenAI"]:::primary CA --> P2["Anthropic"]:::primary CA --> P3["Google"]:::primary CA --> P4["Mistral / others"]:::primary end style Old fill:#eaf6ff,stroke:#0ea5e9,stroke-dasharray:5 5,color:#0c4a6e
graph TB classDef primary fill:#e0f2fe,stroke:#0ea5e9,color:#0c4a6e classDef secondary fill:#ffedd5,stroke:#f97316,color:#7c2d12 classDef fallback fill:#d1fae5,stroke:#10b981,color:#064e3b subgraph New["New: native APIs preserved, gateway as control plane"] AppN["Application
+ preferences"]:::primary --> GwN["Gateway
routing · telemetry · governance"]:::secondary GwN --> A1["OpenAI Responses API"]:::primary GwN --> A2["Anthropic Messages API"]:::primary GwN --> A3["Gemini API"]:::primary GwN --> A4["Completions API as
fallback"]:::fallback end style New fill:#fff4e6,stroke:#f97316,stroke-dasharray:5 5,color:#7c2d12

A pragmatic pattern #

To achieve this, for each use case, define an ordered provider preference list and handle model selection at runtime based on e.g. CEL-based routing rules. Fall back to a generic completions client when nothing is available. Pydantic AI ships FallbackModel as a first-class concept. Bifrost has a native Pydantic AI integration that wires the framework’s preference list to the gateway directly.

What works well:

  1. Environment dictates availability. Which providers are configured comes from environment variables read at startup. If ANTHROPIC_API_KEY is missing, Anthropic is not in the candidate list. The same applies to Google and OpenAI. A generic chat completions endpoint, configured separately, is the universal fallback.

  2. Startup checks loudly. On boot the application pings each configured provider and emits warning logs for any that are declared but unreachable. Misconfiguration becomes a deploy-time error rather than a runtime surprise.

  3. Each call site picks its preference order. A code-generation agent might prefer Claude first because it handles big codebases well. A research agent might prefer Gemini first for grounding. A strict-extraction agent might prefer GPT-5.4 with the Responses API.

In the best case, an optimized native call lands on the model best suited for the workload. In the worst case, every preferred provider is missing or unreachable at request time and the agent still answers through the chat completions endpoint, just without the native features. Multiple candidates also serve as runtime insurance: a provider reachable at boot can be down thirty minutes later because of a regional outage, a quota reset or a transient connectivity issue.

A simplified but working Pydantic AI implementation (without headers or extra options for routing) could look like this:

import os
from pydantic_ai import Agent
from pydantic_ai.models.anthropic import AnthropicModel
from pydantic_ai.models.fallback import FallbackModel
from pydantic_ai.models.google import GoogleModel
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider


def build_code_agent() -> Agent:
    candidates = []

    if "ANTHROPIC_API_KEY" in os.environ:
        candidates.append(AnthropicModel("claude-sonnet-4-6", ...))
    if "GOOGLE_API_KEY" in os.environ:
        candidates.append(GoogleModel("gemini-3-1-pro", ...))
    if "OPENAI_API_KEY" in os.environ:
        candidates.append(OpenAIModel("gpt-5-4", ...))

    # Universal OpenAI-compatible fallback for any
    # chat completions endpoint.
    candidates.append(
        OpenAIModel(
            "default",
            provider=OpenAIProvider(
                base_url=os.environ["CHAT_COMPLETIONS_URL"],
            ),
        )
    )

    return Agent(FallbackModel(*candidates))

The shape is FallbackModel(optimized_1, optimized_2, ..., chat_fallback). Optimized branches are provider-native classes that expose each provider’s specific features. The completions standard is the lowest-common-denominator, present in every deployment.

What comes next #

The trend likely accelerates. Providers will keep pushing proprietary APIs as the primary surface, because that is where their differentiation lives and the chat completions API will continue to decline in relevance. Compatibility was a phase produced by ecosystem incentives, and those incentives have flipped. Every new feature now ships as a native primitive that a unified API cannot represent.

Standardization, if it happens at all, will likely come from framework authors and aggregators rather than from providers since unlike the providers, they have a direct stake in cross-provider portability. The argument that worked with three or four major players gets harder once you count open-source labs, regional clouds, Mistral and others.

Designing for fragmentation #

The shared API was the cheap interface. The native APIs are the real product surfaces. Treating providers as interchangeable was a convenience that worked while their products were effectively the same.

The architectural takeaway is small but consequential. Pick models for workloads. Express that picking deliberately, in code, with provider-specific clients where it matters and a chat completions fallback for everywhere it does not. Let environment variables decide which providers are reachable at deploy time, and let preference lists decide which one wins at request time.

That is what designing for further fragmentation looks like in practice. The abstraction lives at the agent and its preference list, which is where the divergence between providers and their strengths actually has to be reasoned about. The shared API has a place, but it is no longer the foundation.