Skip to main content

DSPy for Engineers: Separate Intent from Implementation

·7 mins

Part of the DSPy series, this article introduces the mental model and core concepts you need to ship measurable DSPy components with confidence.

The model-specificity problem #

When you prompt from code, you write prompts and tune them until the model does what you want. Over time you add more guidance to handle edge cases. More examples on what to do and more constraints on what not to do. The outputs mostly improve, but there’s constant uncertainty about whether switching models or running on the latest model version will break your carefully crafted prompts.

This happens because string-based prompts are tuned to a specific model, often even a specific model version. What works well with GPT-5 might produce different results with Claude Sonnet 4.5 or even GPT-5.1. A prompt optimized for gpt-5-high can behave unpredictably on gpt-5-xhigh. Because prompts are just strings, there’s no abstraction layer protecting you from these changes.

The non-deterministic nature of language models makes this worse. Small variations in wording can send the model down different paths in its latent space, meaning you are not building your logic on stable ground.

DSPy addresses this by separating the logical intent (what you want) from the prompt engineering (how to get it).


Part 1: The runtime #

With DSPy, you no longer ship “prompts.” You ship Signatures and Modules. Here is what a simple DSPy classifier looks like compared to a traditional string prompt.

String-based approach:

import litellm

prompt = f"""
Classify the customer message into one of these categories:
- billing
- technical  
- sales
- other

Customer message: {user_input}

Respond with only the category name.
"""

response = litellm.completion(
    model="anthropic/claude-4-5-sonnet",
    messages=[{
        "role": "user",
        "content": prompt
    }]
)
category = response.choices[0].message.content.strip()
# Hope it matches exactly and passes validation,
# else retry, including the error message for context

DSPy approach:

import dspy
from dspy.adapters.json_adapter import JSONAdapter
from enum import StrEnum

# 1. Define your domain
class SupportCategory(StrEnum):
    BILLING = "billing"
    TECHNICAL = "technical"
    SALES = "sales"
    OTHER = "other"

# 2. Define the contract (Signature)
class Classify(dspy.Signature):
    """Classify customer message into a category."""
    message: str = dspy.InputField()
    category: SupportCategory = dspy.OutputField()

# 3. Configure the model
lm = dspy.LM(
    "anthropic/claude-4-5-sonnet",
    api_key="your-api-key"
)
dspy.configure(lm=lm, adapter=JSONAdapter())

# 4. Use it (Module)
classifier = dspy.Predict(Classify)
result = classifier(message="I can't log in")
print(result.category)  # SupportCategory.TECHNICAL

How this works behind the scenes #

When you call classifier(message="I can't log in"), DSPy handles the translation from Python objects to LLM prompts and back.

graph TD A[Your Code] --> B[DSPy constructs prompt
from signature] B --> C[Prompt includes:
- Docstring as task desc
- SupportCategory enum values
- JSON format instructions] C --> D[LiteLLM sends to model:
anthropic/claude-4-5-sonnet] D --> E[Model returns JSON response] E --> F[JSONAdapter parses response] F --> G[Returns typed Prediction object] style A fill:#e0f2fe,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e style B fill:#e0f2fe,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e style C fill:#ffedd5,stroke:#f97316,stroke-width:2px,color:#7c2d12 style D fill:#e0f2fe,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e style E fill:#e0f2fe,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e style F fill:#e0f2fe,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e style G fill:#d1fae5,stroke:#10b981,stroke-width:2px,color:#064e3b %% Edge Styles linkStyle default stroke:#576679,stroke-width:2px

Key concepts:

  • Signature (Classify): Defines the input/output contract. The docstring becomes the task description.
  • Adapters (JSONAdapter): DSPy automatically instructs the model to output JSON and parses it back into your SupportCategory enum.
  • Type Safety: You work with Python enums, not loose strings.

Part 2: The optimization loop (development) #

The runtime code above gives you structure, but it doesn’t guarantee accuracy. This is where DSPy’s Optimizer comes in. Instead of manually editing the signature docstring or adding examples to the prompt, you define success and let DSPy find the best prompt.

1. The data #

You need examples of what “good” looks like. The fields in dspy.Example must match your signature fields (message and category).

# Small dataset (10-50 examples is usually enough to start)
# Ideally, these come from a CSV/JSON file maintained by QA
trainset = [
    dspy.Example(
        message="Where is my bill?",
        category=SupportCategory.BILLING
    ),
    dspy.Example(
        message="It's broken",
        category=SupportCategory.TECHNICAL
    ),
    # ...
].with_inputs("message")

Key detail: .with_inputs("message") tells DSPy which fields are inputs. The other fields (here category) are treated as expected labels.

2. The metric #

You tell DSPy how to grade the model’s work.

def validate_support_category(example, pred):
    # exact match on the enum
    return example.category == pred.category

3. The optimizer #

Pick an optimization strategy. BootstrapFewShot is a great starter. It looks at your dataset, runs your metric and automatically selects the best examples to include in the prompt as few-shot demonstrations.

from dspy.teleprompt import BootstrapFewShot

# Optimize the signature
optimizer = BootstrapFewShot(
    metric=validate_support_category,
    max_bootstrapped_demos=4
)
compiled_classifier = optimizer.compile(
    dspy.Predict(Classify),
    trainset=trainset
)

compiled_classifier.save("support_category_classifier.json")

It’s crucial to understand that optimization is a build-time step, like training a model or compiling code. You run this script offline or in CI. The output is support_category_classifier.json, which contains the optimized prompts and examples.

In Production: Your app doesn’t import the dataset or optimizer. It just loads the optimized behavior:

# Production code
classifier = dspy.Predict(Classify)
classifier.load("support_category_classifier.json")

# Now it has the optimized few-shot examples baked in
result = classifier(message="...")

The feedback loop #

graph TD subgraph Runtime A["Signature"] H["Compiled Module
(support_category_classifier.json)"] end subgraph BuildTime B["Optimizer"] D["Dataset"] M["Metric"] E{"Run and Measure"} F["Discard example"] G["Add to Prompt"] end A --> B D --> B M --> B B --> E E -->|Success| G E -->|Fail| F G --> H %% Runtime: Pastel Blue style A fill:#e0f2fe,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e style H fill:#e0f2fe,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e %% Build-time: Pastel Orange style B fill:#ffedd5,stroke:#f97316,stroke-width:2px,color:#7c2d12 style D fill:#ffedd5,stroke:#f97316,stroke-width:2px,color:#7c2d12 style M fill:#ffedd5,stroke:#f97316,stroke-width:2px,color:#7c2d12 style E fill:#ffedd5,stroke:#f97316,stroke-width:2px,color:#7c2d12 %% Fail: Pastel Red style F fill:#ffe4e6,stroke:#f43f5e,stroke-width:2px,color:#881337 %% Success: Pastel Green style G fill:#d1fae5,stroke:#10b981,stroke-width:2px,color:#064e3b %% Subgraph Backgrounds style Runtime fill:#eaf6ff,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e,stroke-dasharray: 5 5 style BuildTime fill:#fff4e6,stroke:#f97316,stroke-width:2px,color:#7c2d12,stroke-dasharray: 5 5 %% Edge Styles %% Dark Grey for standard flow (0,1,2,3,6) linkStyle 0,1,2,3,6 stroke:#576679,stroke-width:2px %% Success Edge (4): Green Arrow (Standard text) linkStyle 4 stroke:#10b981,stroke-width:2px %% Fail Edge (5): Red Arrow (Standard text) linkStyle 5 stroke:#f43f5e,stroke-width:2px

DSPy distinguishes between Build-time optimization and Runtime execution. During Build-time, the optimizer uses your labeled dataset and metric to experiment with different prompts. It keeps the variations that produce the best results and saves them into a compiled module (e.g., support_category_classifier.json).

At Runtime, your application simply loads this JSON file. It doesn’t need the dataset, the metric or the optimizer, just the optimized instructions and examples.


When to run evaluations #

Evaluations in DSPy aren’t just a one-off thing. Because you have a codified metric, you can use it in different phases:

During Development (Optimization): The metric is the “loss function” for your prompts. When you run a DSPy optimizer, it will automatically run your metric dozens or even hundreds of times, testing different instruction phrasings and few-shot examples to see which one maximizes the score.

During Maintenance (Regression Testing): This is where you solve the Model Drift problem. You run a full evaluation in three specific scenarios:

  1. Model Updates: When a model provider releases a new version, you rerun your eval (using dspy.Evaluate). If the score drops, you re-optimize. That is, you set a threshold and compare the output, if it’s too low you optimize again and replace the json file.
  2. Model Swapping: To try to move to a different, ideally cheaper or faster model (e.g., Claude Sonnet to Haiku), you run the eval to see if it still reaches your quality threshold.
  3. Drift Detection: Model APIs can change silently even within the same “version” tag. Running your evaluations on a separate, high-quality validation dataset (distinct from your training examples) before a release alerts you if the model’s performance degrades or shifts unexpectedly, allowing you to catch issues before your users do. Just make sure to never optimize based on the holdout data.

Conclusion #

DSPy is a great choice when you need reliability and measurement. If you are building a production feature that must survive model updates or strict business requirements, DSPy’s optimizer acts as your safety net. It allows you to iterate based on data and let the system fix the prompt, rather than guessing which wording or prompt formatting might make the model behave better.

In the next tutorials in this series I will build small, measurable components:

  • Classification: Route user messages to categories based on intent. Build a classifier with a strict label set, measure exact match and see how eval loops work in practice.
  • Structured output with adapters: Generate code without markdown wrappers or unwanted explanations. Use adapters to enforce format constraints (no ```sql blocks, just code) and validate outputs parse correctly.
  • Ranking and selection: Identify the most meaningful columns from 100+ options using context (missingness, domain, relationships). Measure agreement with human-labeled important columns.
  • Optimizers: Compare BootstrapFewShot to more elaborate optimizers. Understand what they search over, how to control costs and how to avoid overfitting to small dev sets.
  • Production patterns: Versioning, regression testing and model swaps. Includes a case study on tool-using modules (address validation with external APIs) to show how production concerns change when you add external dependencies.