Why Prompting from Code Isn't Like Using ChatGPT

Part of the Prompting from Code series, this article looks at how prompts that feel natural in chat can become fragile once embedded in code.

The shift from chat to code #

Prompting an LLM from a terminal or a web UI is simple. You write something, submit it and move along in the conversation. You, the human, are in charge, interpreting, giving feedback and deciding the direction based on prior responses.

But once a prompt exists in code, everything changes. The human in the loop disappears and every response now needs to stand on its own, without retries or quick clarifications.

What feels intuitive in a conversation becomes fragile when automated. Now, a user or another service that cannot interact with the underlying model or control its context depends on it. But the issue isn’t the model itself, it’s that programmatic prompting turns natural language into part of your system’s logic and requires a different approach.

The hidden fragility of prompts #

Language models are stochastic in nature. Small shifts in temperature, context or model version can send them down an entirely different path in their latent space. In a chat, that’s an acceptable or even desired part of the experience. In production software, it’s a liability.

When a prompt sits behind an API or UI its outputs feed into other systems. Parsers, validators and renderers will depend on it. A different or missing quote character, a missing colon or an object unexpectedly wrapped in an array can break the entire flow. Testing and handling those responses is difficult because you’re not evaluating a single output but a distribution of possible ones.

And yet, more tests are not the solution. You can’t unit-test model creativity. You need to design your prompts and code around its volatility. Prompts deserve the same attention as code, such as versioning and thoughtful and incremental changes. When models receive updates, prompts implicitly age. What worked a few months back might now produce redundant or confusing results. Because of the sheer number of possibilities, these regressions often go unnoticed in QA and only surface in error logs or support tickets, diminishing user experience.

Why the kitchen-sink prompt fails #

The first instinct when prompts start misbehaving is to expand the instructions: more constraints, more “Good:” and “Bad:” examples and more capitalized “IMPORTANT:” lines. But the more you add the more it dilutes the meaning of each individual instruction. And after just a few iterations, the system prompt can resemble an unnecessarily repetitive policy document, often tweaked to address the latest QA issue even if the fix only works most of the time.

This approach is not sustainable or maintainable long-term, especially because every model interprets verbosity differently. Some reward explicit markers (“DO NOT”, “CRITICAL”, XML tags or elaborate disclaimers). Others overfit, repeating instructions verbatim or distorting structure. The same prompt can shine in one model and suffocate another.

Ultimately, a model doesn’t care what you meant, only what you said. Without the right structure for the model you are using, your intent dissolves.

Designing for stability without losing flexibility #

Working with LLMs in production is a constant trade-off between stability and flexibility. You want outputs you can rely on, but the system also needs to adapt as models evolve. Lock everything down and the system becomes fragile. Leave it too generic or unspecific and outputs become flaky.

Structure without overfitting: The key is building boundaries that contain flakiness while allowing the model to operate effectively. For example, instead of adding ten instructions about JSON formatting, you enforce structured output through schema validation. Instead of prompting “be concise” repeatedly, set max_tokens in the API call. The prompt handles intent, the code handles constraints.

Guardrails give prompts context, reduce unexpected behavior and make outputs predictable enough to integrate with downstream systems and handle them effectively.

Practical approaches: API abstraction layers like Litellm standardize behavior across providers and model versions. Prompt adaptation frameworks like DSPy programmatically optimize prompts as models change. These strategies treat prompts as modular, testable components rather than ad-hoc instructions.

When managed well, this balance produces a system that is reliable, resilient and easier to maintain over time, letting your prompts do their job without being over-tuned to a single model version.

Conclusion: prompts as code, not conversation #

When you prompt from code, you stop chatting with the model and start defining behavior through it. The difference may seem subtle at first. Then a model update breaks your flow or a user triggers an edge case you never anticipated.

Structure isn’t over-engineering. It’s the deliberate design that keeps you in control of your prompts and system integration: consistent formatting, clear instructions, modular and versioned components and guardrails that keep outputs predictable. This kind of structure lets your system remain maintainable and resilient even as models change or behave unpredictably.

In a chat, a good prompt gets the desired results.
In a system, structure prevents failures even as models change.

Programmatic prompting isn’t about finding the right words. It’s about designing prompts that work as part of a system, not just a conversation.

The next part of this series covers the first line of defense against this fragility: intent classification. Understanding what users mean before the prompt even runs is how you prevent the model from taking your system down paths you never intended.