Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Learn to build a production-ready classification system using Claude AI. This guide covers prompt engineering, RAG, and chain-of-thought to achieve 95%+ accuracy on complex insurance support tickets.
This guide shows you how to build an insurance support ticket classifier using Claude, progressing from basic prompts to advanced techniques like RAG and chain-of-thought reasoning, achieving 95%+ accuracy on 10 categories.
Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Classification is one of the most powerful and practical applications of large language models (LLMs) like Claude. Whether you're routing customer support tickets, moderating content, or categorizing documents, LLMs offer a flexible, explainable alternative to traditional machine learning—especially when you're dealing with complex business rules or limited training data.
In this guide, you'll build a production-ready classification system that categorizes insurance support tickets into 10 distinct categories. You'll start with a simple prompt and progressively layer in advanced techniques—prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning—to push accuracy from roughly 70% to over 95%.
Prerequisites
- Python 3.11+ with basic familiarity
- An Anthropic API key (get one here)
- A VoyageAI API key (optional—embeddings can be pre-computed)
- Basic understanding of classification problems
Setup and Installation
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Next, load your API keys and set up the Claude client:
import os
from anthropic import Anthropic
Load API keys from environment variables
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
Set your model
MODEL_NAME = "claude-3-opus-20240229"
Problem Definition: Insurance Support Ticket Classifier
Imagine you're building a system for a large insurance company. Every day, thousands of support tickets arrive—billing questions, claims assistance, policy changes, and more. Manually categorizing these is slow and error-prone. Your goal is to build an automated classifier that can sort tickets into 10 categories with high accuracy.
Here are the categories we'll use (synthetically generated by Claude 3 Opus):
- Billing Inquiries – Questions about invoices, charges, fees, premiums, payment methods
- Policy Administration – Policy changes, cancellations, renewals, adding/removing coverage
- Claims Assistance – Claims process, filing procedures, claim status, payout timelines
- Coverage Explanations – What's covered, limits, exclusions, deductibles
- Account Management – Login issues, profile updates, password resets
- Fraud and Security – Suspicious activity, identity theft, fraud prevention
- Policy Documentation – Requesting policy documents, certificates, ID cards
- Agent and Broker Support – Questions about agents, commissions, licensing
- Compliance and Regulatory – Regulatory questions, legal compliance, state-specific rules
- General Inquiries – Miscellaneous questions not fitting other categories
Step 1: The Baseline Prompt
Let's start simple. We'll create a basic prompt that asks Claude to classify a ticket based on the category definitions.
def classify_ticket_baseline(ticket_text: str, categories: list) -> str:
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into one of these categories:
{chr(10).join([f'{i+1}. {cat}' for i, cat in enumerate(categories)])}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: This baseline approach typically achieves around 70-75% accuracy. It works for straightforward tickets but struggles with edge cases, ambiguous language, and tickets that span multiple categories.
Step 2: Improving with Prompt Engineering
Prompt engineering is your first lever for improvement. Let's refine our prompt with:
- Clear instructions about output format
- Few-shot examples to show Claude what good classifications look like
- Explicit reasoning steps (chain-of-thought)
def classify_ticket_engineered(ticket_text: str, categories: list) -> str:
prompt = f"""You are an expert insurance ticket classifier. Your task is to classify the following support ticket into exactly one category.
Categories:
{chr(10).join([f'{i+1}. {cat}' for i, cat in enumerate(categories)])}
Examples:
Ticket: "I need to update my home address on my auto policy"
Category: Policy Administration
Ticket: "When will my claim payment be deposited?"
Category: Claims Assistance
Ticket: "I was charged twice for my monthly premium"
Category: Billing Inquiries
Now, classify this ticket. First, think step-by-step about what the customer is asking. Then, output only the category name.
Ticket: {ticket_text}
Step-by-step reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: With careful prompt engineering and few-shot examples, accuracy jumps to around 80-85%. The chain-of-thought reasoning helps Claude handle ambiguous cases by explicitly working through the logic.
Step 3: Implementing Retrieval-Augmented Generation (RAG)
To push beyond 85%, we need to give Claude more context—specifically, relevant examples from our training data. This is where RAG comes in.
RAG works by:
- Converting all training examples into vector embeddings
- When a new ticket comes in, finding the most similar examples
- Including those examples in the prompt as additional context
Step 3.1: Create Embeddings
import voyageai
vo = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"))
Convert training tickets to embeddings
training_texts = [ticket["text"] for ticket in training_data]
embeddings = vo.embed(training_texts, model="voyage-2").embeddings
Step 3.2: Build a Retrieval Function
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def retrieve_similar_examples(query: str, k: int = 3):
# Embed the query
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
# Compute similarity with all training examples
similarities = cosine_similarity([query_embedding], embeddings)[0]
# Get top-k indices
top_indices = np.argsort(similarities)[-k:][::-1]
# Return the most similar examples
return [training_data[i] for i in top_indices]
Step 3.3: Augment the Prompt with Retrieved Examples
def classify_ticket_rag(ticket_text: str, categories: list) -> str:
# Retrieve similar examples
similar_examples = retrieve_similar_examples(ticket_text, k=3)
# Format examples for the prompt
examples_text = ""
for ex in similar_examples:
examples_text += f"Ticket: {ex['text']}\nCategory: {ex['category']}\n\n"
prompt = f"""You are an expert insurance ticket classifier. Classify the following ticket into exactly one category.
Categories:
{chr(10).join([f'{i+1}. {cat}' for i, cat in enumerate(categories)])}
Here are some similar tickets and their correct categories for reference:
{examples_text}
Now classify this ticket. First, think step-by-step. Then, output only the category name.
Ticket: {ticket_text}
Step-by-step reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: RAG pushes accuracy to 90-95%. The retrieved examples provide concrete, relevant context that helps Claude make better decisions, especially for edge cases.
Step 4: Adding Chain-of-Thought for Explainability
One of the biggest advantages of using Claude for classification is explainability. By asking Claude to reason step-by-step, you get both the classification and the reasoning behind it.
def classify_ticket_explainable(ticket_text: str, categories: list) -> dict:
similar_examples = retrieve_similar_examples(ticket_text, k=3)
examples_text = ""
for ex in similar_examples:
examples_text += f"Ticket: {ex['text']}\nCategory: {ex['category']}\n\n"
prompt = f"""You are an expert insurance ticket classifier. Classify the following ticket into exactly one category.
Categories:
{chr(10).join([f'{i+1}. {cat}' for i, cat in enumerate(categories)])}
Reference examples:
{examples_text}
Ticket: {ticket_text}
First, think step-by-step about what the customer is asking. Then, provide your final classification.
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
full_response = response.content[0].text.strip()
# Parse the response to extract reasoning and category
# (In practice, you'd use a more robust parsing method)
return {
"full_response": full_response,
"category": extract_category(full_response, categories),
"reasoning": extract_reasoning(full_response)
}
Result: Now you get both the classification and a human-readable explanation. This is invaluable for auditing, debugging, and building trust with stakeholders.
Testing and Evaluation
To properly evaluate your classifier, split your data into training and test sets:
from sklearn.model_selection import train_test_split
Assuming you have a list of tickets with their true categories
train_data, test_data = train_test_split(
all_tickets,
test_size=0.2,
random_state=42,
stratify=[t["category"] for t in all_tickets]
)
Evaluate on test set
correct = 0
for ticket in test_data:
predicted = classify_ticket_rag(ticket["text"], categories)
if predicted == ticket["category"]:
correct += 1
accuracy = correct / len(test_data)
print(f"Accuracy: {accuracy:.2%}")
Key Takeaways
- Start simple, then iterate. Begin with a basic prompt, measure your baseline, then progressively add prompt engineering, few-shot examples, and RAG. Each layer adds meaningful accuracy gains.
- RAG dramatically improves accuracy. By retrieving and including similar examples in the prompt, you give Claude the context it needs to handle edge cases and ambiguous tickets, pushing accuracy from ~80% to 95%+.
- Chain-of-thought reasoning provides explainability. Unlike traditional ML classifiers, Claude can explain why it made a classification. This is critical for compliance, debugging, and stakeholder trust.
- LLMs excel where traditional ML struggles. Complex business rules, limited training data, and the need for explainability are all scenarios where Claude-based classification outperforms traditional approaches.
- Productionize with care. In a real deployment, you'll want to add confidence thresholds, human-in-the-loop review for low-confidence predictions, and continuous monitoring to catch drift.