Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Learn to build an insurance support ticket classifier using Claude AI. This step-by-step guide covers prompt engineering, RAG, and chain-of-thought reasoning to achieve 95%+ accuracy.
You'll learn to build a high-accuracy classification system using Claude that categorizes insurance support tickets into 10 categories, improving accuracy from 70% to 95%+ through prompt engineering, RAG, and chain-of-thought reasoning.
Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Customer support ticket classification is a classic problem in the insurance industry—but traditional machine learning approaches often struggle with complex business rules, limited training data, and the need for explainable results. Large Language Models (LLMs) like Claude offer a powerful alternative.
In this guide, you'll build a production-ready classification system that categorizes insurance support tickets into 10 distinct categories. You'll learn how to progressively improve classification accuracy from a baseline of ~70% to over 95% by combining three key techniques:
- Prompt Engineering – Crafting effective prompts that guide Claude's reasoning
- Retrieval-Augmented Generation (RAG) – Providing relevant examples at inference time
- Chain-of-Thought Reasoning – Encouraging step-by-step analysis before classification
Prerequisites
Before diving in, make sure you have:
- Python 3.11+ with basic familiarity
- An Anthropic API key – Get one here
- A VoyageAI API key (optional – embeddings are pre-computed in the cookbook)
- Basic understanding of classification problems
Step 1: Setting Up Your Environment
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Next, set up your API keys and initialize the Claude client:
import os
from anthropic import Anthropic
Load API keys from environment variables
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY")
VOYAGE_API_KEY = os.environ.get("VOYAGE_API_KEY")
Initialize Claude client
client = Anthropic(api_key=ANTHROPIC_API_KEY)
MODEL_NAME = "claude-3-opus-20240229" # or claude-3-sonnet for faster/cheaper
Step 2: Understanding the Problem & Data
We'll build a classifier for an insurance company that receives thousands of support tickets daily. The goal is to automatically route each ticket to the correct department by categorizing it into one of 10 categories.
Category Definitions
Here are the 10 categories we'll use (synthetically generated by Claude 3 Opus):
- Billing Inquiries – Questions about invoices, charges, fees, premiums, payment methods
- Policy Administration – Policy changes, updates, cancellations, renewals
- Claims Assistance – Claims process, filing procedures, claim status
- Coverage Explanations – What's covered, limits, exclusions, deductibles
- Account Management – Login issues, profile updates, password resets
- Document Requests – Requesting policy documents, ID cards, certificates
- Agent Assistance – Finding agents, agent contact info, agent changes
- Complaints & Feedback – Service complaints, feedback, escalations
- Fraud & Security – Suspicious activity, fraud reporting, security concerns
- General Inquiries – Other questions not fitting above categories
Load and Prepare the Data
import pandas as pd
from sklearn.model_selection import train_test_split
Load your dataset (example structure)
df = pd.read_csv("insurance_tickets.csv")
Split into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")
Step 3: Baseline Classification with Prompt Engineering
Let's start with a simple zero-shot classification prompt. This will give us our baseline accuracy.
def classify_ticket_zero_shot(ticket_text, categories):
"""Classify a ticket using zero-shot prompting."""
category_descriptions = "\n".join([f"{i+1}. {cat}" for i, cat in enumerate(categories)])
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{category_descriptions}
Ticket: {ticket_text}
Respond with ONLY the category number (1-10)."""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Expected baseline accuracy: ~70-75%. Not bad, but we can do much better.
Step 4: Improving Accuracy with RAG (Retrieval-Augmented Generation)
The key insight: instead of relying solely on Claude's training data, we can retrieve the most similar examples from our training set and include them in the prompt. This dramatically improves accuracy.
Create a Vector Database
import voyageai
import numpy as np
vo = voyageai.Client(api_key=VOYAGE_API_KEY)
Generate embeddings for training data
train_texts = train_df["ticket_text"].tolist()
train_embeddings = vo.embed(train_texts, model="voyage-2").embeddings
Store in a simple numpy array for similarity search
train_embeddings = np.array(train_embeddings)
Implement Similarity Search
from sklearn.metrics.pairwise import cosine_similarity
def find_similar_examples(query, k=3):
"""Find the k most similar training examples to the query."""
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
similarities = cosine_similarity([query_embedding], train_embeddings)[0]
top_indices = np.argsort(similarities)[-k:][::-1]
examples = []
for idx in top_indices:
examples.append({
"text": train_df.iloc[idx]["ticket_text"],
"category": train_df.iloc[idx]["category"]
})
return examples
Augment the Prompt with Retrieved Examples
def classify_ticket_with_rag(ticket_text, categories):
"""Classify using RAG: retrieve similar examples and include in prompt."""
similar_examples = find_similar_examples(ticket_text, k=3)
examples_text = ""
for i, ex in enumerate(similar_examples):
examples_text += f"Example {i+1}:\nTicket: {ex['text']}\nCategory: {ex['category']}\n\n"
category_descriptions = "\n".join([f"{i+1}. {cat}" for i, cat in enumerate(categories)])
prompt = f"""You are an insurance support ticket classifier.
Here are some examples of correctly classified tickets:
{examples_text}
Now classify the following ticket into exactly one of these categories:
{category_descriptions}
Ticket: {ticket_text}
Respond with ONLY the category number (1-10)."""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Expected accuracy with RAG: ~85-90%. A significant improvement!
Step 5: Chain-of-Thought Reasoning for 95%+ Accuracy
The final technique: ask Claude to reason step-by-step before outputting the final category. This helps with ambiguous cases and complex business rules.
def classify_ticket_cot(ticket_text, categories):
"""Classify using chain-of-thought reasoning + RAG."""
similar_examples = find_similar_examples(ticket_text, k=3)
examples_text = ""
for i, ex in enumerate(similar_examples):
examples_text += f"Example {i+1}:\nTicket: {ex['text']}\nCategory: {ex['category']}\n\n"
category_descriptions = "\n".join([f"{i+1}. {cat}" for i, cat in enumerate(categories)])
prompt = f"""You are an insurance support ticket classifier.
Here are some examples of correctly classified tickets:
{examples_text}
Categories:
{category_descriptions}
Ticket to classify: {ticket_text}
First, think step-by-step about which category best fits this ticket. Consider:
- What is the main topic of the ticket?
- Which category definition matches best?
- Are there any edge cases or ambiguities?
Then, on the last line, output ONLY the category number (1-10) in this format:
Category: [number]"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
# Extract the final category from the response
full_response = response.content[0].text
# Parse the last line for "Category: X"
for line in full_response.split("\n"):
if line.startswith("Category:"):
return line.split(":")[1].strip()
return full_response # fallback
Expected accuracy with CoT + RAG: 95%+
Step 6: Testing and Evaluation
Now let's evaluate our final classifier on the test set:
from sklearn.metrics import accuracy_score, classification_report
def evaluate_classifier(classifier_fn, test_df, categories):
"""Evaluate a classifier on the test dataset."""
predictions = []
true_labels = []
for _, row in test_df.iterrows():
predicted = classifier_fn(row["ticket_text"], categories)
predictions.append(predicted)
true_labels.append(row["category"])
accuracy = accuracy_score(true_labels, predictions)
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(true_labels, predictions))
return accuracy
Evaluate the final classifier
final_accuracy = evaluate_classifier(classify_ticket_cot, test_df, categories)
Best Practices for Production
- Cache embeddings – Generate embeddings once and store them in a vector database like Pinecone or Weaviate for production use.
- Monitor drift – Track accuracy over time; retrain/re-evaluate as new ticket types emerge.
- Handle edge cases – Add a "Confidence Threshold" – if Claude's confidence is low, route to a human reviewer.
- Log everything – Store prompts, responses, and classifications for audit and improvement.
- Use the right model – Claude 3 Opus for highest accuracy, Claude 3 Sonnet for cost-sensitive applications.
Key Takeaways
- Start simple, then layer complexity – Begin with zero-shot prompting, add RAG for context, then chain-of-thought for reasoning. Each layer adds meaningful accuracy gains.
- RAG dramatically improves accuracy – Providing 3-5 similar examples at inference time can boost accuracy by 15-20 percentage points without any fine-tuning.
- Chain-of-thought reasoning handles ambiguity – Asking Claude to reason step-by-step before outputting a classification helps resolve edge cases and complex business rules.
- This framework is reusable – The same techniques apply to any classification problem: customer support routing, content moderation, document sorting, and more.
- Explainability is built-in – Unlike traditional ML classifiers, Claude can provide natural language explanations for its decisions, making it easier to audit and debug.