Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Learn to build a production-ready classification system using Claude, prompt engineering, and RAG. Achieve 95%+ accuracy on complex business rules with limited training data.
This guide walks you through building an insurance support ticket classifier using Claude. You'll learn prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning to boost accuracy from 70% to over 95%—even with limited training data.
Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Classification is one of the most common and impactful use cases for large language models (LLMs). Whether you're routing support tickets, moderating content, or categorizing documents, getting classification right can save hours of manual work and improve customer satisfaction.
In this guide, you'll build a production-ready classification system using Claude that categorizes insurance support tickets into 10 distinct categories. You'll start with a simple prompt and progressively improve accuracy from roughly 70% to over 95% by combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning.
Why Use Claude for Classification?
Traditional machine learning classifiers require large amounts of labeled training data and struggle with complex business rules or edge cases. Claude excels here because:
- Handles complex business rules without needing thousands of examples
- Works with limited training data—sometimes just 10–20 examples per class
- Provides natural language explanations for every classification decision
- Adapts quickly to new categories or changing requirements
Prerequisites
Before diving in, make sure you have:
- Python 3.11+ installed
- An Anthropic API key
- A VoyageAI API key (optional—embeddings can be pre-computed)
- Basic familiarity with Python and classification concepts
Setup: Installing Dependencies
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Then, set up your API keys and initialize the Claude client:
import os
from anthropic import Anthropic
Load API keys from environment
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
Set model name
MODEL_NAME = "claude-3-opus-20240229"
Step 1: Define Your Classification Problem
For this guide, we'll build an Insurance Support Ticket Classifier. The goal is to route incoming tickets to the right department by categorizing them into one of 10 categories. Here are the first four:
- Billing Inquiries – Questions about invoices, charges, fees, and premiums
- Policy Administration – Requests for policy changes, updates, or cancellations
- Claims Assistance – Questions about the claims process and filing procedures
- Coverage Explanations – Questions about what is covered under specific policy types
Step 2: Build a Simple Baseline Classifier
Let's start with a straightforward prompt that asks Claude to classify a ticket into one of the defined categories:
def classify_ticket_baseline(ticket_text, categories):
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
This baseline will likely achieve around 70% accuracy. The problem? Claude has no context about what each category really means, and it has no examples to learn from.
Step 3: Improve Accuracy with Prompt Engineering
To boost accuracy, we need to provide:
- Clear category definitions with examples of what each category includes
- Output formatting instructions to ensure consistent responses
- Few-shot examples showing correct classifications
def classify_ticket_engineered(ticket_text, categories_with_definitions, examples):
prompt = f"""You are an expert insurance support ticket classifier.
Categories and their definitions:
{categories_with_definitions}
Here are some examples of correctly classified tickets:
{examples}
Classify the following ticket. Respond with ONLY the category name.
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
With clear definitions and 2–3 examples per category, accuracy typically jumps to 85–90%.
Step 4: Implement Retrieval-Augmented Generation (RAG)
For the biggest accuracy boost, we'll implement RAG. Instead of hardcoding examples, we'll store our training data in a vector database and retrieve the most relevant examples for each new ticket.
Create Embeddings and Store in a Vector Database
import voyageai
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
vo = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"))
Create embeddings for all training examples
training_texts = [example["text"] for example in training_data]
training_embeddings = vo.embed(training_texts, model="voyage-2").embeddings
Retrieve Relevant Examples at Classification Time
def retrieve_examples(query, training_embeddings, training_data, k=3):
# Embed the query
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
# Compute similarity scores
similarities = cosine_similarity([query_embedding], training_embeddings)[0]
# Get top-k most similar examples
top_indices = np.argsort(similarities)[-k:][::-1]
return [training_data[i] for i in top_indices]
Combine RAG with Prompt Engineering
def classify_ticket_rag(ticket_text, categories_with_definitions, training_embeddings, training_data):
# Retrieve relevant examples
relevant_examples = retrieve_examples(ticket_text, training_embeddings, training_data)
# Format examples for the prompt
examples_text = "\n".join([
f"Ticket: {ex['text']}\nCategory: {ex['category']}"
for ex in relevant_examples
])
prompt = f"""You are an expert insurance support ticket classifier.
Categories and their definitions:
{categories_with_definitions}
Here are the most relevant examples for this ticket:
{examples_text}
Classify the following ticket. Respond with ONLY the category name.
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
With RAG, accuracy consistently reaches 95%+ because Claude gets the most relevant examples for each query.
Step 5: Add Chain-of-Thought Reasoning for Explainability
One of Claude's superpowers is providing natural language explanations. By adding a chain-of-thought (CoT) step, we get both the classification and a justification:
def classify_ticket_cot(ticket_text, categories_with_definitions, training_embeddings, training_data):
relevant_examples = retrieve_examples(ticket_text, training_embeddings, training_data)
examples_text = "\n".join([
f"Ticket: {ex['text']}\nCategory: {ex['category']}"
for ex in relevant_examples
])
prompt = f"""You are an expert insurance support ticket classifier.
Categories and their definitions:
{categories_with_definitions}
Relevant examples:
{examples_text}
Ticket: {ticket_text}
First, think step-by-step about which category this ticket belongs to. Then, provide your final answer.
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
full_response = response.content[0].text.strip()
# Parse out the category from the reasoning
# (In practice, you might ask Claude to output JSON with both fields)
return full_response
Now you get both the classification and a human-readable explanation—critical for compliance and auditing in regulated industries like insurance.
Step 6: Evaluate Your Classifier
Finally, test your classifier against a held-out test set:
from sklearn.metrics import accuracy_score, classification_report
predictions = []
true_labels = []
for ticket in test_data:
pred = classify_ticket_rag(
ticket["text"],
categories_with_definitions,
training_embeddings,
training_data
)
predictions.append(pred)
true_labels.append(ticket["category"])
accuracy = accuracy_score(true_labels, predictions)
print(f"Accuracy: {accuracy:.2%}")
print(classification_report(true_labels, predictions))
With the full pipeline, you should see accuracy above 95% with clear per-category precision and recall metrics.
Key Takeaways
- Start simple, then iterate. A baseline prompt gets ~70% accuracy. Adding clear definitions and few-shot examples boosts it to 85–90%. RAG pushes it past 95%.
- RAG is a game-changer for classification. By retrieving the most relevant examples for each query, you give Claude the context it needs without overwhelming it with irrelevant data.
- Chain-of-thought reasoning adds transparency. In regulated industries, being able to explain why a ticket was classified a certain way is just as important as the classification itself.
- Claude handles complex business rules with minimal data. You don't need thousands of labeled examples—10–20 per category is often enough to build a highly accurate classifier.
- This approach generalizes beyond insurance. The same pattern—prompt engineering + RAG + CoT—works for any classification problem, from content moderation to document routing.