Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Learn to build a production-grade classification system using Claude, prompt engineering, and RAG. Improve accuracy from 70% to 95%+ with practical Python examples.
This guide teaches you to build a high-accuracy classification system using Claude, combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning to improve accuracy from 70% to 95%+ for categorizing insurance support tickets.
Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Classification is one of the most common and impactful use cases for Large Language Models (LLMs) in business. Whether you're routing support tickets, moderating content, or categorizing documents, getting classification right can dramatically improve operational efficiency.
In this guide, you'll build a production-grade classification system using Claude that categorizes insurance support tickets into 10 distinct categories. You'll learn how to progressively improve accuracy from a baseline of ~70% to over 95% by combining three powerful techniques: prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning.
By the end, you'll have a reusable framework for building classification systems that handle complex business rules, work with limited training data, and provide explainable results.
Prerequisites
- Python 3.11+ with basic familiarity
- An Anthropic API key
- A VoyageAI API key (optional—embeddings can be pre-computed)
- Basic understanding of classification problems
Why LLMs for Classification?
Traditional machine learning approaches to classification often struggle with:
- Complex business rules that are hard to encode as features
- Limited or low-quality training data
- Evolving categories that require frequent retraining
- Lack of interpretability—you get a label but no explanation
- Understanding nuanced, context-dependent rules from natural language descriptions
- Performing well with few-shot examples (sometimes zero-shot)
- Providing natural language explanations for every classification decision
- Adapting quickly to new categories via prompt updates
Project Overview: Insurance Support Ticket Classifier
We'll build a system that classifies insurance support tickets into 10 categories:
- Billing Inquiries – Questions about invoices, charges, premiums
- Policy Administration – Policy changes, cancellations, renewals
- Claims Assistance – Claims process, documentation, status
- Coverage Explanations – What's covered, limits, exclusions
- Account Management – Login issues, profile updates
- Fraud Reporting – Suspicious activity, identity theft
- Agent Assistance – Agent contact, referrals
- Complaints – Service issues, escalations
- General Inquiries – Company info, hours, website help
- Other – Anything that doesn't fit above
Step 1: Setup and Data Preparation
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Now, let's set up our environment and load the data:
import os
import pandas as pd
import numpy as np
from anthropic import Anthropic
Load API keys
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
Set model
MODEL_NAME = "claude-3-opus-20240229"
Load your training and test data
Assuming CSV files with 'text' and 'label' columns
train_df = pd.read_csv("insurance_tickets_train.csv")
test_df = pd.read_csv("insurance_tickets_test.csv")
print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")
print(f"Categories: {train_df['label'].unique()}")
Step 2: Baseline Classification with Zero-Shot Prompting
Let's start with a simple zero-shot approach to establish a baseline:
def classify_ticket_zero_shot(ticket_text, categories):
"""Classify a ticket using zero-shot prompting."""
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Test on a sample
ticket = "I need help understanding why my premium increased this quarter."
result = classify_ticket_zero_shot(ticket, category_definitions)
print(f"Predicted: {result}")
Expected accuracy: ~70-75% — Not bad, but we can do much better.
Step 3: Improving Accuracy with Few-Shot Examples
Adding a few carefully selected examples dramatically improves performance:
def classify_ticket_few_shot(ticket_text, categories, examples):
"""Classify using few-shot examples."""
examples_text = "\n\n".join([
f"Ticket: {ex['text']}\nCategory: {ex['label']}"
for ex in examples
])
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Here are some examples:
{examples_text}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Expected accuracy: ~80-85% — A solid improvement, but we're still missing context for edge cases.
Step 4: Implementing Retrieval-Augmented Generation (RAG)
This is where things get interesting. Instead of manually selecting examples, we'll use a vector database to retrieve the most relevant examples for each query dynamically.
import voyageai
from sklearn.metrics.pairwise import cosine_similarity
Initialize VoyageAI
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
Generate embeddings for training data
def get_embeddings(texts):
result = vo.embed(texts, model="voyage-2", input_type="document")
return result.embeddings
Pre-compute training embeddings
train_embeddings = get_embeddings(train_df["text"].tolist())
Retrieve similar examples
def retrieve_similar_examples(query, k=5):
query_embedding = get_embeddings([query])[0]
similarities = cosine_similarity([query_embedding], train_embeddings)[0]
top_indices = np.argsort(similarities)[-k:][::-1]
return [
{
"text": train_df.iloc[i]["text"],
"label": train_df.iloc[i]["label"],
"similarity": similarities[i]
}
for i in top_indices
]
Now, let's build the RAG-powered classifier:
def classify_ticket_rag(ticket_text, categories):
"""Classify using RAG to retrieve relevant examples."""
# Retrieve similar examples
similar_examples = retrieve_similar_examples(ticket_text, k=5)
# Build prompt with retrieved examples
examples_text = "\n\n".join([
f"Ticket: {ex['text']}\nCategory: {ex['label']}"
for ex in similar_examples
])
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Here are the most relevant examples from our database:
{examples_text}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Expected accuracy: ~88-92% — The RAG approach adapts to each query, providing contextually relevant examples.
Step 5: Adding Chain-of-Thought Reasoning
For the final accuracy boost, we'll add chain-of-thought (CoT) reasoning. This forces Claude to think step-by-step before outputting a classification:
def classify_ticket_cot(ticket_text, categories):
"""Classify using chain-of-thought reasoning with RAG."""
similar_examples = retrieve_similar_examples(ticket_text, k=5)
examples_text = "\n\n".join([
f"Ticket: {ex['text']}\nCategory: {ex['label']}"
for ex in similar_examples
])
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Here are the most relevant examples from our database:
{examples_text}
First, think step-by-step about what the ticket is asking about.
Then, provide your final classification.
Ticket: {ticket_text}
Let me think through this step by step:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Expected accuracy: ~95%+ — The CoT approach provides transparency and catches edge cases.
Step 6: Evaluation and Metrics
Let's evaluate our system properly:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
def evaluate_classifier(classifier_fn, test_df, categories):
"""Evaluate a classifier on the test set."""
predictions = []
for idx, row in test_df.iterrows():
pred = classifier_fn(row["text"], categories)
predictions.append(pred)
if (idx + 1) % 50 == 0:
print(f"Processed {idx + 1}/{len(test_df)} tickets...")
# Calculate metrics
accuracy = accuracy_score(test_df["label"], predictions)
report = classification_report(test_df["label"], predictions)
return accuracy, report, predictions
Run evaluation
accuracy, report, predictions = evaluate_classifier(
classify_ticket_cot,
test_df,
category_definitions
)
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(report)
Best Practices for Production
- Handle edge cases explicitly: Add an "Other" category and instruct Claude to use it when uncertain
- Use structured output: Request JSON format for easier parsing
- Implement confidence thresholds: Flag low-confidence classifications for human review
- Cache embeddings: Pre-compute and store embeddings to reduce API calls
- Monitor and iterate: Log all classifications and periodically review misclassifications to improve your prompts
Key Takeaways
- Start simple, then layer complexity: Begin with zero-shot prompting, add few-shot examples, then implement RAG and chain-of-thought reasoning for maximum accuracy
- RAG dramatically improves classification: By retrieving the most relevant examples for each query, you provide context that helps Claude handle edge cases and ambiguous tickets
- Chain-of-thought reasoning adds transparency: CoT not only improves accuracy but also provides explanations you can use for auditing and debugging
- LLM-based classification excels where traditional ML struggles: Complex business rules, limited training data, and the need for interpretability are all strengths of the LLM approach
- Production systems need guardrails: Implement confidence thresholds, structured output parsing, and human-in-the-loop review for critical classifications