Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Learn how to build a production-ready classification system using Claude, prompt engineering, and RAG. This step-by-step guide takes you from basic prompts to 95%+ accuracy on complex business rules.
This guide teaches you to build a high-accuracy classification system using Claude by combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning. You'll progress from 70% to 95%+ accuracy on a real-world insurance ticket classification problem.
Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Large Language Models (LLMs) have transformed classification tasks, especially where traditional ML struggles with complex business rules or limited training data. In this guide, you'll build a production-ready classification system that categorizes insurance support tickets into 10 categories, progressively improving accuracy from ~70% to 95%+ using Claude, prompt engineering, and Retrieval-Augmented Generation (RAG).
Why LLMs for Classification?
Traditional classification systems often require:
- Large labeled datasets
- Extensive feature engineering
- Retraining when business rules change
- Understanding natural language instructions and business rules directly
- Working effectively with few or zero examples
- Providing explainable, natural language justifications for each classification
- Adapting quickly to new categories without retraining
Prerequisites
Before starting, ensure you have:
- Python 3.11+ installed
- An Anthropic API key
- Basic familiarity with Python and classification concepts
- (Optional) A VoyageAI API key for embeddings
Setup
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Then set up your API client:
import anthropic
import os
client = anthropic.Anthropic(
api_key=os.environ.get("ANTHROPIC_API_KEY")
)
MODEL_NAME = "claude-3-opus-20240229" # or claude-3-sonnet for faster/cheaper
Step 1: Define Your Classification Problem
We'll build an insurance support ticket classifier with 10 categories. Here are the first four (the full set is in the source notebook):
- Billing Inquiries – Questions about invoices, charges, fees, premiums, payment methods
- Policy Administration – Policy changes, cancellations, renewals, coverage options
- Claims Assistance – Claims process, documentation, status, payout timelines
- Coverage Explanations – What's covered, limits, exclusions, deductibles
Step 2: Start with a Simple Prompt (Baseline ~70%)
Let's begin with a straightforward prompt that asks Claude to classify based on category definitions:
def classify_ticket_simple(ticket_text):
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into exactly one of these categories:
- Billing Inquiries
- Policy Administration
- Claims Assistance
- Coverage Explanations
- Account Management
- Underwriting
- Fraud & Compliance
- Agent Support
- Product Information
- General Inquiry
Respond with ONLY the category number and name.
Ticket: {ticket_text}
Classification:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
This simple approach typically achieves around 70% accuracy. It works for straightforward cases but struggles with:
- Ambiguous tickets that could fit multiple categories
- Edge cases requiring nuanced understanding
- Tickets with industry-specific terminology
Step 3: Add Chain-of-Thought Reasoning (Improves to ~85%)
Chain-of-thought (CoT) prompting dramatically improves accuracy by asking Claude to reason step-by-step before giving the final answer:
def classify_ticket_cot(ticket_text):
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into exactly one of these categories.
Categories:
- Billing Inquiries - Questions about invoices, charges, fees, premiums, payment methods
- Policy Administration - Policy changes, cancellations, renewals, coverage options
- Claims Assistance - Claims process, documentation, status, payout timelines
- Coverage Explanations - What's covered, limits, exclusions, deductibles
... (all 10 categories)
First, think step-by-step about what the customer is asking about. Consider:
- What is the main topic of their question?
- What specific action or information are they requesting?
- Which category best matches their primary concern?
Then provide your final classification.
Ticket: {ticket_text}
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
By asking Claude to "think out loud," you get:
- Higher accuracy (~85%) because the model works through ambiguity
- Explainable results – you can see why Claude chose a category
- Better handling of edge cases
Step 4: Implement Retrieval-Augmented Generation (RAG) for 95%+ Accuracy
RAG supercharges your classifier by providing relevant examples from your training data. Here's how it works:
- Create embeddings for all your training examples
- Store them in a vector database (or simple in-memory index)
- At classification time, find the most similar examples to the new ticket
- Include those examples in the prompt as few-shot examples
Creating the Embedding Index
import voyageai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
vo = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"))
Create embeddings for training data
def create_embeddings(texts):
response = vo.embed(texts, model="voyage-2", input_type="document")
return response.embeddings
Example: embed your training tickets
training_texts = ["I need to update my payment method...", ...] # Your training data
training_embeddings = create_embeddings(training_texts)
training_labels = ["Billing Inquiries", ...] # Corresponding labels
Retrieving Similar Examples
def find_similar_examples(query, k=3):
# Embed the query
query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
# Calculate similarities
similarities = cosine_similarity([query_embedding], training_embeddings)[0]
# Get top-k indices
top_indices = np.argsort(similarities)[-k:][::-1]
# Return the most similar examples
examples = []
for idx in top_indices:
examples.append({
"text": training_texts[idx],
"label": training_labels[idx],
"similarity": similarities[idx]
})
return examples
The RAG-Enhanced Classification Prompt
def classify_ticket_rag(ticket_text):
# Retrieve similar examples
similar_examples = find_similar_examples(ticket_text, k=3)
# Build examples section
examples_section = ""
for i, ex in enumerate(similar_examples, 1):
examples_section += f"Example {i}:\nTicket: {ex['text']}\nCategory: {ex['label']}\n\n"
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into exactly one of these categories.
Categories:
- Billing Inquiries
- Policy Administration
- Claims Assistance
... (all 10 categories)
Here are some similar examples from our database:
{examples_section}
First, think step-by-step about what the customer is asking. Then provide your final classification.
Ticket to classify: {ticket_text}
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Step 5: Evaluate Your Classifier
Create a test set and measure accuracy:
from sklearn.metrics import accuracy_score, classification_report
def evaluate_classifier(classifier_func, test_tickets, test_labels):
predictions = []
for ticket in test_tickets:
result = classifier_func(ticket)
# Parse the category from the response
predicted_category = parse_category(result)
predictions.append(predicted_category)
accuracy = accuracy_score(test_labels, predictions)
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(test_labels, predictions))
return accuracy
Production Considerations
When deploying your classifier:
- Cache embeddings – Pre-compute and store embeddings to avoid API calls on every request
- Batch processing – Use Claude's batch API for high-volume classification
- Confidence thresholds – Flag low-confidence classifications for human review
- Feedback loop – Collect misclassifications to improve your prompt and examples
- Cost optimization – Use Claude 3 Haiku for simpler tickets, Sonnet/Opus for complex ones
Key Takeaways
- Start simple, then iterate – Begin with a basic prompt, add chain-of-thought reasoning, then layer in RAG for maximum accuracy
- RAG dramatically improves accuracy – By providing relevant examples at inference time, you can achieve 95%+ accuracy without fine-tuning
- Chain-of-thought provides explainability – Claude's reasoning process helps you understand and debug misclassifications
- LLMs handle complex business rules – Unlike traditional ML, you can encode nuanced rules directly in natural language prompts
- Productionize with caching and batching – Optimize for cost and latency while maintaining high accuracy