Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Learn how to build a production-ready classification system using Claude, prompt engineering, and RAG. This step-by-step guide takes you from 70% to 95%+ accuracy on complex insurance support tickets.
You will learn how to build a high-accuracy insurance support ticket classifier using Claude, combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning to improve accuracy from 70% to over 95%.
Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Classification is one of the most practical and high-impact use cases for large language models (LLMs). Whether you're routing customer support tickets, moderating content, or categorizing documents, getting classification right can save hours of manual work and dramatically improve response times.
In this guide, you'll build a production-ready classification system using Claude that categorizes insurance support tickets into 10 distinct categories. You'll start with a simple prompt-based approach (achieving ~70% accuracy) and progressively layer in advanced techniques—prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning—to push accuracy beyond 95%.
By the end, you'll have a reusable pattern you can adapt to any classification problem, even with limited training data.
Prerequisites
Before diving in, make sure you have:
- Python 3.11+ installed with basic familiarity
- An Anthropic API key (get one here)
- A VoyageAI API key (optional—embeddings are pre-computed in the cookbook)
- Basic understanding of classification problems
Understanding the Problem
Insurance companies receive thousands of support tickets daily. Manually categorizing these tickets is slow, error-prone, and expensive. The goal is to automatically classify each ticket into one of 10 categories:
- Billing Inquiries – Questions about invoices, charges, fees, premiums
- Policy Administration – Policy changes, renewals, cancellations
- Claims Assistance – Claims process, documentation, status
- Coverage Explanations – What's covered, limits, exclusions
- Account Management – Login issues, profile updates
- Underwriting – Risk assessment, policy issuance
- Fraud Reporting – Suspicious activity, fraud claims
- Compliance – Regulatory questions, legal requirements
- Agent Support – Agent tools, commissions
- General Inquiry – Anything that doesn't fit above
Step 1: Setting Up Your Environment
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Next, load your API keys and initialize the Claude client:
import os
from anthropic import Anthropic
Load API keys from environment variables
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
Set your model
MODEL_NAME = "claude-3-opus-20240229"
Step 2: The Baseline – Simple Prompt Classification
Let's start with a straightforward approach: ask Claude to classify a ticket based on category definitions alone.
def classify_ticket_baseline(ticket_text: str) -> str:
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into one of these categories:
- Billing Inquiries
- Policy Administration
- Claims Assistance
- Coverage Explanations
- Account Management
- Underwriting
- Fraud Reporting
- Compliance
- Agent Support
- General Inquiry
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: This baseline typically achieves around 70% accuracy. It works for clear-cut cases but struggles with ambiguous tickets or edge cases.
Step 3: Improving with Few-Shot Prompting
Adding a few high-quality examples to the prompt can significantly boost performance. This is called few-shot prompting.
def classify_ticket_few_shot(ticket_text: str) -> str:
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into one of the 10 categories.
Examples:
- "I need to update my address on my auto policy" -> Policy Administration
- "When will my claim payment be issued?" -> Claims Assistance
- "Why did my premium increase this month?" -> Billing Inquiries
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: Accuracy jumps to around 80-85%. The examples help Claude understand the nuances between categories.
Step 4: Adding Chain-of-Thought Reasoning
Chain-of-thought (CoT) prompting asks the model to reason step-by-step before giving the final answer. This is especially powerful for complex classifications.
def classify_ticket_cot(ticket_text: str) -> str:
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into one of the 10 categories.
First, think step-by-step:
- What is the main topic of the ticket?
- What specific action or information is being requested?
- Which category best matches this?
Ticket: {ticket_text}
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
# Extract the final category from the response
full_response = response.content[0].text.strip()
# Parse the last line as the category
category = full_response.split("\n")[-1]
return category
Result: Accuracy climbs to 88-92%. The reasoning step helps Claude avoid jumping to conclusions.
Step 5: Retrieval-Augmented Generation (RAG) – The Game Changer
RAG takes classification to the next level. Instead of hardcoding a few examples, you store your entire labeled dataset in a vector database and retrieve the most relevant examples for each new ticket.
5.1 Build the Vector Database
import voyageai
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
Initialize VoyageAI client
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
Example: embed your training data
training_tickets = [
"I need to cancel my policy",
"Where is my claim payment?",
"Why was I charged a late fee?",
# ... more examples
]
training_labels = [
"Policy Administration",
"Claims Assistance",
"Billing Inquiries",
# ... corresponding labels
]
Generate embeddings for all training tickets
training_embeddings = vo.embed(
training_tickets,
model="voyage-2"
).embeddings
5.2 Retrieve Relevant Examples
def retrieve_similar_tickets(query: str, k: int = 3):
# Embed the query
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
# Compute cosine similarity
similarities = cosine_similarity([query_embedding], training_embeddings)[0]
# Get top-k indices
top_k_indices = np.argsort(similarities)[-k:][::-1]
# Return the most similar tickets and their labels
similar_tickets = [training_tickets[i] for i in top_k_indices]
similar_labels = [training_labels[i] for i in top_k_indices]
return similar_tickets, similar_labels
5.3 Classify with RAG
def classify_ticket_rag(ticket_text: str) -> str:
# Retrieve similar examples
similar_tickets, similar_labels = retrieve_similar_tickets(ticket_text, k=3)
# Build examples string
examples = "\n".join([
f'- "{ticket}" -> {label}'
for ticket, label in zip(similar_tickets, similar_labels)
])
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into one of the 10 categories.
Here are some similar tickets and their correct categories:
{examples}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: With RAG, accuracy reaches 95%+. The model now has dynamic, contextually relevant examples for every query.
Step 6: Testing and Evaluation
To properly evaluate your classifier, split your data into training and test sets. Then run the classifier on the test set and compare predictions to ground truth labels.
from sklearn.metrics import accuracy_score, classification_report
def evaluate_classifier(test_tickets, test_labels, classifier_fn):
predictions = []
for ticket in test_tickets:
pred = classifier_fn(ticket)
predictions.append(pred)
accuracy = accuracy_score(test_labels, predictions)
report = classification_report(test_labels, predictions)
return accuracy, report
Example usage
accuracy, report = evaluate_classifier(test_tickets, test_labels, classify_ticket_rag)
print(f"Accuracy: {accuracy:.2%}")
print(report)
Putting It All Together: The Complete Pipeline
Here's the final, production-ready classification function that combines all techniques:
def classify_insurance_ticket(ticket_text: str) -> dict:
"""
Classify an insurance support ticket with explainable results.
Returns:
dict with 'category', 'confidence', and 'reasoning'
"""
# Step 1: Retrieve similar examples
similar_tickets, similar_labels = retrieve_similar_tickets(ticket_text, k=5)
# Step 2: Build prompt with examples and chain-of-thought
examples = "\n".join([
f'- "{t}" -> {l}'
for t, l in zip(similar_tickets, similar_labels)
])
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket.
Relevant examples:
{examples}
Think step-by-step:
- What is the main topic?
- What action is requested?
- Which category fits best?
Ticket: {ticket_text}
Reasoning:"""
# Step 3: Get response from Claude
response = client.messages.create(
model=MODEL_NAME,
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
full_response = response.content[0].text.strip()
# Step 4: Parse the response
lines = full_response.split("\n")
category = lines[-1] # Last line is the category
reasoning = "\n".join(lines[:-1]) # Everything else is reasoning
return {
"category": category,
"reasoning": reasoning
}
Key Takeaways
- Start simple, then iterate. Begin with a baseline prompt, then add few-shot examples, chain-of-thought reasoning, and finally RAG. Each step adds measurable accuracy gains.
- RAG is a game-changer for classification. By dynamically retrieving the most relevant examples for each query, you can achieve 95%+ accuracy even with limited training data.
- Chain-of-thought reasoning improves explainability. Asking Claude to reason step-by-step not only boosts accuracy but also provides a transparent audit trail for every classification decision.
- This pattern is reusable. The techniques you've learned here—prompt engineering, few-shot learning, CoT, and RAG—can be applied to any classification problem, from content moderation to document routing.
- Always evaluate. Use a held-out test set and metrics like accuracy and classification report to measure real-world performance before deploying.