Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+
Learn to build a production-ready classification system using Claude, prompt engineering, and RAG. Improve accuracy from 70% to 95%+ with practical Python examples.
This guide shows you how to build a high-accuracy insurance support ticket classifier using Claude. You'll learn prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning to boost classification accuracy from 70% to over 95%.
Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+
Large Language Models (LLMs) like Claude have transformed the classification landscape. Unlike traditional machine learning systems that require thousands of labeled examples and struggle with complex business rules, LLMs can achieve remarkable accuracy with limited data while providing natural language explanations for their decisions.
In this guide, you'll build a production-ready insurance support ticket classifier that categorizes tickets into 10 distinct categories. You'll learn how to progressively improve accuracy from a baseline of ~70% to over 95% by combining three powerful techniques: prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning.
Prerequisites
Before starting, ensure you have:
- Python 3.11+ installed
- An Anthropic API key
- Basic familiarity with Python and classification concepts
- (Optional) A VoyageAI API key for custom embeddings
Setup and Installation
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Next, set up your environment and initialize the Claude client:
import os
from anthropic import Anthropic
Load API key from environment
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
MODEL_NAME = "claude-3-opus-20240229"
Understanding the Problem
Insurance companies receive thousands of support tickets daily. Manually categorizing these tickets into departments like billing, claims, or policy administration is slow and error-prone. Our goal is to build an automated classifier that handles:
- Complex business rules (e.g., a ticket about "deductible" could be billing or claims depending on context)
- Limited training data (we'll work with just 200 examples)
- Explainable results (Claude provides reasoning for each classification)
Category Definitions
We'll classify tickets into 10 categories:
- Billing Inquiries – Invoices, charges, fees, premiums
- Policy Administration – Changes, renewals, cancellations
- Claims Assistance – Filing, status, documentation
- Coverage Explanations – Limits, exclusions, deductibles
- Account Management – Login, profile updates
- Underwriting – Risk assessment, eligibility
- Fraud & Compliance – Suspicious activity, regulatory
- Agent Support – Commission, tools
- Product Information – New offerings, features
- General Inquiry – Miscellaneous
Step 1: Data Preparation
We'll split our synthetic dataset into training (150 examples) and test (50 examples) sets. The training data will be used for few-shot examples and embedding generation.
import pandas as pd
from sklearn.model_selection import train_test_split
Load your dataset
Assuming df has columns: 'text' (ticket content) and 'label' (category)
df = pd.read_csv('insurance_tickets.csv')
train_df, test_df = train_test_split(df, test_size=0.25, random_state=42)
print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")
Step 2: Baseline Classification with Prompt Engineering
Let's start with a simple zero-shot prompt. This is our baseline:
def classify_ticket_baseline(ticket_text, categories):
prompt = f"""Classify the following insurance support ticket into one of these categories:
{categories}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: ~70% accuracy. Not bad, but we can do much better.
Step 3: Improving Accuracy with Few-Shot Examples
Adding 3-5 carefully selected examples per category dramatically improves performance. The key is selecting examples that are representative and diverse.
def classify_ticket_fewshot(ticket_text, categories, examples):
# Build few-shot prompt
example_text = ""
for ex in examples:
example_text += f"Ticket: {ex['text']}\nCategory: {ex['label']}\n\n"
prompt = f"""You are an insurance ticket classifier. Classify the following ticket into one of these categories:
{categories}
Here are some examples:
{example_text}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: ~82% accuracy. The examples provide crucial context.
Step 4: Implementing Retrieval-Augmented Generation (RAG)
Static few-shot examples have a limit. For best results, we need to dynamically retrieve the most relevant examples for each query. This is where RAG shines.
Building the Vector Database
We'll use embeddings to store and retrieve similar examples:
import voyageai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
Generate embeddings for training data
def get_embeddings(texts):
response = vo.embed(texts, model="voyage-2")
return response.embeddings
train_embeddings = get_embeddings(train_df['text'].tolist())
Retrieving Relevant Examples
For each new ticket, find the most similar training examples:
def retrieve_similar_examples(query, k=5):
query_embedding = get_embeddings([query])[0]
# Calculate cosine similarity
similarities = cosine_similarity([query_embedding], train_embeddings)[0]
# Get top-k indices
top_indices = np.argsort(similarities)[-k:][::-1]
return train_df.iloc[top_indices]
The RAG-Enhanced Classifier
Now combine retrieval with classification:
def classify_ticket_rag(ticket_text, categories):
# Retrieve similar examples
similar = retrieve_similar_examples(ticket_text, k=5)
# Build prompt with retrieved examples
example_text = ""
for _, row in similar.iterrows():
example_text += f"Ticket: {row['text']}\nCategory: {row['label']}\n\n"
prompt = f"""You are an insurance ticket classifier. Classify the following ticket into one of these categories:
{categories}
Here are the most relevant examples:
{example_text}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: ~90% accuracy. Dynamic retrieval beats static examples.
Step 5: Adding Chain-of-Thought Reasoning
For the final accuracy boost, we ask Claude to reason step-by-step before giving the final answer. This is especially powerful for ambiguous cases.
def classify_ticket_cot(ticket_text, categories):
similar = retrieve_similar_examples(ticket_text, k=5)
example_text = ""
for _, row in similar.iterrows():
example_text += f"Ticket: {row['text']}\nCategory: {row['label']}\n\n"
prompt = f"""You are an insurance ticket classifier. Classify the following ticket into one of these categories:
{categories}
Relevant examples:
{example_text}
Ticket: {ticket_text}
First, think step-by-step about what this ticket is asking. Consider:
- What is the main topic or issue?
- What action is the customer requesting?
- Which category best fits based on the definitions and examples?
Then, provide your final answer in this format:
Category: [category name]
Reasoning: [brief explanation]"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
# Parse the response
content = response.content[0].text
# Extract category (assuming format "Category: X")
for line in content.split('\n'):
if line.startswith('Category:'):
return line.replace('Category:', '').strip()
return content
Result: 95%+ accuracy. Chain-of-thought reasoning resolves edge cases.
Evaluation and Testing
Let's evaluate our final classifier against the test set:
def evaluate_classifier(classifier, test_df, categories):
correct = 0
results = []
for _, row in test_df.iterrows():
predicted = classifier(row['text'], categories)
actual = row['label']
is_correct = predicted.lower() == actual.lower()
correct += int(is_correct)
results.append({
'text': row['text'],
'actual': actual,
'predicted': predicted,
'correct': is_correct
})
accuracy = correct / len(test_df)
print(f"Accuracy: {accuracy:.2%}")
return results
Run evaluation
categories = """
- Billing Inquiries
- Policy Administration
- Claims Assistance
- Coverage Explanations
- Account Management
- Underwriting
- Fraud & Compliance
- Agent Support
- Product Information
- General Inquiry
"""
results = evaluate_classifier(classify_ticket_cot, test_df, categories)
Performance Summary
| Technique | Accuracy |
|---|---|
| Zero-shot baseline | ~70% |
| Few-shot (static) | ~82% |
| RAG (dynamic retrieval) | ~90% |
| RAG + Chain-of-Thought | 95%+ |
Production Considerations
When deploying this classifier in production:
- Caching: Cache embeddings for frequent queries to reduce API costs
- Fallback handling: Implement a confidence threshold; route low-confidence predictions to human review
- Monitoring: Track accuracy over time and retrain embeddings as new labeled data arrives
- Latency: RAG retrieval adds ~200ms; chain-of-thought adds ~500ms. Consider if real-time classification is needed
- Cost optimization: Use Claude 3 Haiku for simpler cases, Sonnet for medium, Opus for complex
Key Takeaways
- Start simple, iterate fast: Begin with a zero-shot prompt, then progressively add few-shot examples, RAG, and chain-of-thought reasoning. Each step provides measurable improvement.
- RAG dramatically improves accuracy: Dynamic retrieval of relevant examples outperforms static few-shot prompts by 8-10 percentage points, especially with diverse ticket types.
- Chain-of-thought reasoning resolves ambiguity: Asking Claude to reason step-by-step before classifying boosts accuracy by 5+ percentage points and provides explainable results.
- Limited data is not a barrier: With just 150 training examples, you can achieve 95%+ accuracy by combining prompt engineering with retrieval techniques.
- Always evaluate systematically: Use a held-out test set and track accuracy per category to identify weak spots in your classifier.