DSPy Tutorial: Building a Classification Module
Part of the DSPy series. This tutorial walks through building a small classification module with a dataset, metric and optimization loop. If you are new to DSPy, start with the overview article first.
Classification without the prompt guessing game #
When you build a classifier with string prompts, you typically start with a system message that lists the categories and maybe a few examples. Then you iterate: add more examples for edge cases, tweak the wording, test on a few inputs and hope it generalizes. When the model gets something wrong, you adjust the prompt and test again. This cycle repeats until you run out of time or patience.
The problem is that every change is a guess. You never know if your new phrasing helps overall or just fixes one case while breaking three others. You have no systematic way to measure progress.
DSPy replaces guessing with measurement. You define what success looks like with a metric, provide examples and let the optimizer find the best prompt automatically. When you need to improve accuracy, you add more training examples or try a different optimizer, not rewrite prompt text.
In this tutorial, I build a support ticket classifier from scratch. I’ll walk through defining the task contract (Signature), writing a simple accuracy metric and using BootstrapFewShot to optimize few-shot examples. By the end, you’ll have a working example and understand how to measure and improve classifiers systematically.
The classification task #
Problem: Route incoming customer support messages into four categories: billing, technical, sales or other.
Support ticket routing is a common task with clear success criteria. You either get the category right or you don’t. No ambiguity, which makes it perfect for this first example.
The code is organized as follows:
dspy-01-classification/
├── data/
│ ├── train.json # 20 labeled examples
│ └── test.json # 10 holdout examples
├── support_classifier.py # Signature + module definition
├── optimize.py # Runs BootstrapFewShot optimizer
└── eval.py # Measures accuracy on test set
Step 1: Define the signature #
A Signature is DSPy’s contract for an LLM task. It specifies the input and output fields along with their types and descriptions.
from enum import StrEnum
import dspy
class SupportCategory(StrEnum):
BILLING = "billing"
TECHNICAL = "technical"
SALES = "sales"
OTHER = "other"
class ClassifyMessage(dspy.Signature):
"""Classify customer support message into one category."""
message: str = dspy.InputField(
desc="The customer's support message"
)
category: SupportCategory = dspy.OutputField(
desc="The category that best matches the message"
)
Key design choices:
- Enum for categories: Using
StrEnumgives type safety and makes the valid values explicit. As we see next, DSPy’sJSONAdapterautomatically includes these enum values in the prompt. - Docstring as task description: The signature docstring becomes the instruction to the model. For this simple classification task, it can be as short as a single sentence.
- Field descriptions: These are optional but I recommend including them especially when the field purpose isn’t obvious from the name alone.
The signature defines the logical task without any mention of how to prompt the model. That separation is what makes DSPy different from string-based prompting.
Step 2: Configure DSPy with the LLM #
DSPy uses LiteLLM under the hood, so you can use any supported provider (OpenAI, Anthropic, Google, AWS Bedrock, etc.). Model names follow the format provider/model-name.
For this tutorial, I use Anthropic’s Claude Sonnet 4.5:
import dspy
from dspy.adapters.json_adapter import JSONAdapter
from dotenv import load_dotenv
load_dotenv()
def configure_dspy():
"""Configure DSPy with language model and JSON adapter."""
lm = dspy.LM(
model="anthropic/claude-sonnet-4-5-20250929",
max_retries=2,
timeout=60,
)
dspy.configure(lm=lm, adapter=JSONAdapter())
What this does:
dspy.LM(...): Creates a language model client. Themodelparameter uses LiteLLM’s format:anthropic/claude-sonnet-4-5-20250929, but you could swap inopenai/gpt-5orgemini/gemini-3.0-flashwith a one-line change.JSONAdapter(): Tells DSPy to request JSON output from the model and parse it back into Python types. Critical for structured outputs like enums.dspy.configure(...): Sets both of these as the global defaults for all DSPy modules in the script.
Environment setup: LiteLLM automatically picks up credentials from environment variables. Create a .env file:
ANTHROPIC_API_KEY=your_api_key
Step 3: Create the baseline classifier #
With the signature defined and DSPy configured, I create a classifier in one line:
classifier = dspy.Predict(ClassifyMessage)
dspy.Predict is the simplest DSPy module. It takes a signature and makes a single LLM call to produce the output. No chain-of-thought reasoning, no tool use, just input to output.
Using the classifier:
result = classifier(message="I can't log in to my account")
print(result.category) # SupportCategory.TECHNICAL
The result is a dspy.Prediction object with fields that match the signature’s output fields. Access result.category to get the enum value.
What DSPy does behind the scenes:
(docstring, fields, enum values, JSON format)"] B --> C["Send to LLM via LiteLLM"] C --> D["Parse and validate JSON response"] D --> E["Return Prediction object"] style A fill:#e0f2fe,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e style B fill:#e0f2fe,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e style C fill:#ffedd5,stroke:#f97316,stroke-width:2px,color:#7c2d12 style D fill:#e0f2fe,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e style E fill:#d1fae5,stroke:#10b981,stroke-width:2px,color:#064e3b linkStyle default stroke:#576679,stroke-width:2px
DSPy constructs the prompt from the signature, field descriptions and the adapter’s format requirements, not the typical developer-provided prompt text.
Step 4: Prepare the dataset #
The optimizer needs labeled examples to learn from. These go in JSON files with one example per object. For simplicity, I’ll use small datasets here and no validation set.
data/train.json (20 examples, used to train the model):
[
{
"message": "I haven't received my invoice yet",
"category": "billing"
},
{
"message": "The app crashes when I export data",
"category": "technical"
},
{
"message": "Can I upgrade to the enterprise plan?",
"category": "sales"
}
]
data/test.json (10 examples) has the same format and is used to evaluate the model.
Note: DSPy recommends reversing the usual 80-20 split for training and validation for most optimizers to emphasizes stable validation and to avoid overfitting.
Loading the data:
import json
from pathlib import Path
import dspy
from support_classifier import SupportCategory
def load_dataset(split: Literal["train", "test"]) -> list[dspy.Example]:
"""Load training or test data from JSON files."""
data_dir = Path(__file__).parent / "data"
file_path = data_dir / f"{split}.json"
with open(file_path) as f:
data = json.load(f)
examples = [
dspy.Example(
message=item["message"],
category=SupportCategory(item["category"]),
).with_inputs("message")
for item in data
]
return examples
Note: .with_inputs("message") tells DSPy that message is the input field. The remaining fields (here just category) are treated as labels for optimization and evaluation.
Step 5: Define the metric #
The metric defines how to measure success. For classification with a closed label set, exact match is the standard choice.
def accuracy_metric(example: dspy.Example, pred, trace=None):
"""
Simple metric which returns 1.0 if predicted category matches
expected, 0.0 otherwise.
"""
return float(example.category == pred.category)
How DSPy uses this:
DSPy optimizers expect metrics to return a score between 0.0 and 1.0. More elaborate metrics work for more complex tasks, but for classification, binary scoring (0.0 or 1.0) is sufficient.
- During optimization, the metric scores each candidate prompt. DSPy keeps the prompt variations that maximize the average score.
- During evaluation, the metric measures your module’s performance on the test set.
Step 6: Run the baseline evaluation #
Before optimizing, I measure the baseline accuracy. This shows how well the classifier performs with just the signature and no few-shot examples.
eval.py (baseline only):
from dspy.evaluate import Evaluate
from optimize import accuracy_metric, load_dataset
from support_classifier import ClassifyMessage, configure_dspy
configure_dspy()
testset = load_dataset("test")
evaluator = Evaluate(
devset=testset,
metric=accuracy_metric,
display_progress=True,
)
baseline_classifier = dspy.Predict(ClassifyMessage)
baseline_score = evaluator(baseline_classifier)
print(f"Baseline Accuracy: {baseline_score:.1%}")
Output:
Average Metric: 80.0% (8/10): 100%|██████████| 10/10
Baseline Accuracy: 80.0%
The baseline classifier gets 80% accuracy without any few-shot examples in this simple task. That’s already reasonable but can get a lot better with optimization.
Step 7: Optimize with BootstrapFewShot #
BootstrapFewShot is a simple optimizer. It selects the most suitable few-shot examples from the training set by testing which lead to the highest metric scores.
optimize.py:
from dspy.teleprompt import BootstrapFewShot
from support_classifier import ClassifyMessage, configure_dspy
from optimize import accuracy_metric, load_dataset
configure_dspy()
trainset = load_dataset("train")
classifier = dspy.Predict(ClassifyMessage)
optimizer = BootstrapFewShot(
metric=accuracy_metric,
max_labeled_demos=4,
max_bootstrapped_demos=4,
)
compiled_classifier = optimizer.compile(
classifier,
trainset=trainset,
)
compiled_classifier.save("support_classifier_compiled.json")
Understanding the parameters:
max_labeled_demos=4: Maximum number of demonstrations randomly selected from the training set.max_bootstrapped_demos=4: Maximum number of additional demonstrations generated by the classifier itself that pass the metric validation during optimization.
The optimizer tests all training examples and selects the best combination to maximize accuracy. Common configurations:
- Labeled-only (e.g.,
max_bootstrapped_demos=0,max_labeled_demos=4): When baseline fails too often. - Bootstrapped-only (e.g.,
max_bootstrapped_demos=4,max_labeled_demos=0): When baseline already works well. - Mixed (both non-zero): Gives optimizer flexibility to find the best combination.
Increase these values for higher accuracy at the cost of more tokens per request.
What the optimizer outputs:
A JSON file (support_classifier_compiled.json) containing the optimized prompt with selected few-shot examples.
Step 8: Evaluate the optimized classifier #
Finally, I measure the optimized classifier’s accuracy on the same test set.
# ...
optimized_classifier = dspy.Predict(ClassifyMessage) # not yet optimized
optimized_classifier.load("support_classifier_compiled.json") # now it is
optimized_score = evaluator(optimized_classifier)
print(f"Baseline: {baseline_score:.1%}")
print(f"Optimized: {optimized_score:.1%}")
print(f"Change: {optimized_score - baseline_score:+.1%}")
Output:
Baseline: 80.0%
Optimized: 100.0%
Change: +20.0%
The optimized classifier improves from 80% to 100% accuracy!
Understanding the compiled output #
The JSON file contains everything the classifier needs at runtime:
{
"traces": [],
"train": [],
"demos": [
{
"augmented": true,
"message": "I haven't received my invoice for this month yet",
"category": "billing"
},
// ...
],
"signature": {
"instructions": "Classify customer support message into one category.",
"fields": [
{
"prefix": "Message:",
"description": "The customer's support message"
},
{
"prefix": "Category:",
"description": "The category that best matches the message"
}
]
},
"lm": null,
"metadata": {
"dependency_versions": {
"python": "3.11",
"dspy": "3.0.4",
"cloudpickle": "3.1"
}
}
}
What’s included:
- Demonstrations: The few-shot examples selected by the optimizer.
- Instructions: Any optimized prompt text (not shown here because
BootstrapFewShotonly tunes demos).
What’s NOT included:
- The unused training data (the full 20 examples).
- The metric function.
- The optimizer code.
- The language model config. It uses the model specified in the signature on the runtime and keeps the .json model agnostic
In production, only the compiled JSON and the signature definition is required. No other information or files need to be deployed.
ChainOfThought instead of Predict as a more advanced approach #
dspy.Predict makes a single LLM call from input to output. dspy.ChainOfThought adds a reasoning step before producing the answer.
Predict:
classifier = dspy.Predict(ClassifyMessage)
result = classifier(message="I can't log in to my account")
# Output:
# 'category': <SupportCategory.TECHNICAL: 'technical'>
ChainOfThought:
classifier = dspy.ChainOfThought(ClassifyMessage)
result = classifier(message="I can't log in to my account")
# Output (additional reasoning included):
# 'reasoning': 'The customer is experiencing an issue with logging into
# their account. This is a technical problem related to account access
# and authentication, which falls under technical support rather than
# billing, sales, or other general inquiries.'
# 'category': <SupportCategory.TECHNICAL: 'technical'>
Trade-offs:
| Aspect | Predict | ChainOfThought |
|---|---|---|
| Latency | Lower (1 generation) | Higher (2 fields generated) |
| Cost | Lower (fewer tokens) | Higher (rationale tokens) |
| Accuracy | Good for simple tasks | Better for ambiguous tasks |
| Debuggability | Limited (just the answer) | High (includes reasoning) |
When to use ChainOfThought:
- Ambiguous inputs where category isn’t obvious.
- When you need to understand why the model chose a category.
- When accuracy matters more than latency.
When to stick with Predict:
- Clear-cut classification tasks (like this one).
- Latency-sensitive applications.
- When cost per call matters.
For this support ticket classifier, Predict is sufficient. The categories are well-defined and most messages map cleanly to one category. When classifying more ambiguous content, ChainOfThought could be a viable alternative.
Swapping is trivial: Change one line and re-run optimization. The signature stays the same.
Key takeaways #
Signatures separate intent from implementation. Define what you want (input and output types) without specifying how to prompt for it. This makes the code model-agnostic.
Metrics replace guessing. Instead of manually tweaking prompts and hoping they improve overall performance, write a metric once and let the optimizer find the best prompt automatically.
Optimization is a build-time step. Don’t run the optimizer in production. Run it offline, save the compiled module and load it at runtime. This keeps production code simple and fast.
Few-shot examples matter. BootstrapFewShot improved accuracy by 20 percentage points just by selecting the right demonstrations. No prompt engineering, no trial and error.
Evaluation is the safety net. When switching models or updating the optimizer, rerun eval.py on the test set. If the score drops, you know immediately. Run it in the release pipeline as part of the deployment requirements. Without such evals with appropriate thresholds, you have no chance to know how your system will perform.