Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Learn to build a production-ready classification system using Claude, prompt engineering, and RAG. This step-by-step guide takes you from 70% to 95%+ accuracy on complex business rules.
You'll learn to build a high-accuracy classification system using Claude, prompt engineering, and retrieval-augmented generation (RAG). The guide covers data preparation, prompt design, vector search integration, and evaluation—taking accuracy from 70% to 95%+.
Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Large Language Models (LLMs) have transformed classification tasks, especially when dealing with complex business rules, limited training data, or the need for explainable results. In this guide, you'll build a production-ready insurance support ticket classifier using Claude, prompt engineering, and Retrieval-Augmented Generation (RAG).
By the end, you'll have a system that categorizes tickets into 10 categories with 95%+ accuracy—and you'll understand the techniques to replicate this for your own use cases.
Prerequisites
- Python 3.11+ with basic familiarity
- An Anthropic API key (get one here)
- A VoyageAI API key (optional—embeddings are pre-computed in the cookbook)
- Basic understanding of classification problems
Setup
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Then load your API keys and set your model:
import os
from anthropic import Anthropic
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
MODEL_NAME = "claude-3-opus-20240229" # or claude-3-sonnet for faster/cheaper
Problem Definition: Insurance Support Ticket Classifier
Insurance companies receive thousands of support tickets daily—billing questions, policy changes, claims assistance, and more. Manually categorizing these is slow and error-prone. We'll build a classifier that handles 10 categories, including:
- Billing Inquiries – invoices, charges, fees, premiums
- Policy Administration – changes, cancellations, renewals
- Claims Assistance – filing, documentation, status
- Coverage Explanations – limits, exclusions, deductibles
- (and 6 more categories—see the full cookbook)
Step 1: Data Preparation
We'll split our data into training and test sets. The training set is used to build the classifier (via examples in prompts), and the test set evaluates accuracy.
import pandas as pd
from sklearn.model_selection import train_test_split
Load your dataset (synthetic data from the cookbook)
df = pd.read_csv("insurance_tickets.csv")
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
Step 2: Prompt Engineering for Classification
A well-structured prompt is the foundation of accurate classification. Here's a template that includes:
- System message defining the task
- Class definitions for each category
- User query to classify
def build_classification_prompt(query, class_definitions):
system_prompt = """You are an expert insurance ticket classifier. Your task is to categorize support tickets into one of the following categories. Respond with ONLY the category name."""
user_prompt = f"""Category definitions:
{class_definitions}
Ticket to classify:
{query}
Category:"""
return system_prompt, user_prompt
Why this works: Claude understands nuanced business rules from natural language definitions alone. No need for thousands of labeled examples.
Step 3: Implementing Retrieval-Augmented Generation (RAG)
To boost accuracy further, we'll retrieve the most similar examples from our training data and include them in the prompt. This gives Claude concrete reference points.
Generate Embeddings
import voyageai
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
train_texts = train_df["ticket_text"].tolist()
train_embeddings = vo.embed(train_texts, model="voyage-2").embeddings
Build a Vector Store
import numpy as np
from sklearn.neighbors import NearestNeighbors
Index the training embeddings
nn_model = NearestNeighbors(n_neighbors=5, metric="cosine")
nn_model.fit(train_embeddings)
Retrieve and Augment
def retrieve_examples(query, k=3):
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
distances, indices = nn_model.kneighbors([query_embedding], n_neighbors=k)
return train_df.iloc[indices[0]]
def classify_with_rag(query):
examples = retrieve_examples(query, k=3)
example_text = ""
for i, row in examples.iterrows():
example_text += f"Example ticket: {row['ticket_text']}\nCategory: {row['category']}\n\n"
system_prompt = "You are an expert insurance ticket classifier."
user_prompt = f"""Here are some examples of classified tickets:
{example_text}
Now classify this ticket:
{query}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
system=system_prompt,
messages=[{"role": "user", "content": user_prompt}]
)
return response.content[0].text.strip()
Step 4: Adding Chain-of-Thought Reasoning
For even higher accuracy, ask Claude to reason step-by-step before giving the final answer. This is especially useful for ambiguous tickets.
def classify_with_cot(query, examples):
system_prompt = """You are an expert insurance ticket classifier. First, reason step-by-step about the ticket, then provide the final category."""
user_prompt = f"""Examples:
{examples}
Ticket: {query}
Let's think step by step:
- What is the main topic of this ticket?
- Which category definition does it match?
- Final category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
system=system_prompt,
messages=[{"role": "user", "content": user_prompt}]
)
return response.content[0].text.strip()
Step 5: Testing and Evaluation
Now let's evaluate our classifier on the test set:
def evaluate_classifier(classifier_fn, test_df):
correct = 0
total = len(test_df)
for idx, row in test_df.iterrows():
predicted = classifier_fn(row["ticket_text"])
if predicted == row["category"]:
correct += 1
accuracy = correct / total
return accuracy
Baseline: prompt only
baseline_accuracy = evaluate_classifier(classify_baseline, test_df)
print(f"Baseline accuracy: {baseline_accuracy:.2%}") # ~70%
With RAG
rag_accuracy = evaluate_classifier(classify_with_rag, test_df)
print(f"RAG accuracy: {rag_accuracy:.2%}") # ~90%
With RAG + Chain-of-Thought
cot_accuracy = evaluate_classifier(classify_with_cot, test_df)
print(f"RAG + CoT accuracy: {cot_accuracy:.2%}") # ~95%+
Results and Analysis
| Technique | Accuracy |
|---|---|
| Prompt only | ~70% |
| + RAG (3 examples) | ~90% |
| + Chain-of-Thought | ~95%+ |
- RAG provides concrete, similar examples that ground the model's decision
- Chain-of-thought forces the model to reason through the classification logic, reducing errors from jumping to conclusions
- Claude's instruction following ensures it adheres to complex business rules
Production Considerations
- Latency: RAG adds embedding and retrieval time. Consider caching embeddings or using a vector database like Pinecone.
- Cost: More examples and longer prompts increase token usage. Tune
k(number of examples) to balance accuracy and cost. - Explainability: Claude can output its reasoning, making it easy to audit misclassifications.
- Handling edge cases: Add a "None of the above" category for out-of-scope tickets.
Full Code Example
Here's a complete, runnable script:
import os
import pandas as pd
from anthropic import Anthropic
import voyageai
from sklearn.neighbors import NearestNeighbors
Setup
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
MODEL = "claude-3-opus-20240229"
Load data
df = pd.read_csv("insurance_tickets.csv")
train_df, test_df = train_test_split(df, test_size=0.2)
Index training data
train_embeddings = vo.embed(train_df["ticket_text"].tolist(), model="voyage-2").embeddings
nn = NearestNeighbors(n_neighbors=3, metric="cosine")
nn.fit(train_embeddings)
def classify(query):
# Retrieve examples
q_emb = vo.embed([query], model="voyage-2").embeddings[0]
_, indices = nn.kneighbors([q_emb])
examples = train_df.iloc[indices[0]]
example_text = "\n".join([
f"Ticket: {row['ticket_text']}\nCategory: {row['category']}"
for _, row in examples.iterrows()
])
# Classify with CoT
response = client.messages.create(
model=MODEL,
max_tokens=200,
system="You are an expert classifier. Reason step-by-step.",
messages=[{
"role": "user",
"content": f"Examples:\n{example_text}\n\nClassify: {query}"
}]
)
return response.content[0].text.strip()
Test
correct = sum(1 for _, row in test_df.iterrows() if classify(row["ticket_text"]) == row["category"])
print(f"Accuracy: {correct/len(test_df):.2%}")
Key Takeaways
- Prompt engineering alone gets ~70% accuracy on complex classification tasks—good for simple cases, but insufficient for nuanced business rules.
- Adding RAG with 3 similar examples boosts accuracy to ~90% by grounding the model in concrete, relevant examples from your training data.
- Chain-of-thought reasoning pushes accuracy to 95%+ by forcing the model to reason step-by-step, reducing logical errors.
- This approach works with limited training data—you don't need thousands of labeled examples. A few hundred well-chosen examples are often enough.
- Explainability is built-in: Claude can output its reasoning, making it easy to audit, debug, and improve your classifier over time.