Building a High-Accuracy Insurance Ticket Classifier with Claude
Learn to build a production-ready classification system using Claude, prompt engineering, and RAG. Achieve 95%+ accuracy on complex business rules with limited training data.
This guide teaches you to build a high-accuracy classification system using Claude, combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning to categorize insurance support tickets into 10 categories with 95%+ accuracy.
Building a High-Accuracy Insurance Ticket Classifier with Claude
Classification is one of the most common and valuable applications of large language models (LLMs) in business. Traditional machine learning approaches often struggle with complex business rules, limited training data, and the need for explainable results. Claude excels in these scenarios.
In this guide, you'll build a production-ready classification system that categorizes insurance support tickets into 10 distinct categories. You'll learn how to progressively improve classification accuracy from a baseline of ~70% to over 95% by combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning.
Prerequisites
Before starting, ensure you have:
- Python 3.11+ installed
- An Anthropic API key
- Basic familiarity with Python and classification concepts
- (Optional) A VoyageAI API key for generating embeddings
Setup and Installation
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Next, set up your API keys and initialize the Claude client:
import os
from anthropic import Anthropic
Load API keys from environment variables
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
Set your model
MODEL_NAME = "claude-3-opus-20240229"
Understanding the Problem
Insurance companies receive thousands of support tickets daily. Manually categorizing these tickets is slow, expensive, and error-prone. Our goal is to build a system that automatically classifies tickets into categories like:
- Billing Inquiries – Questions about invoices, charges, premiums, and payment methods
- Policy Administration – Requests for policy changes, cancellations, or renewals
- Claims Assistance – Questions about filing claims, documentation, and status
- Coverage Explanations – Clarifications on what is covered, limits, and exclusions
- (and 6 more categories)
Step 1: Data Preparation
Proper data preparation is the foundation of any good classification system. You'll need two datasets:
- Training data: Labeled examples used to build and refine the classifier
- Test data: Unseen examples used to evaluate performance
import pandas as pd
from sklearn.model_selection import train_test_split
Load your dataset
Assume df has columns: 'ticket_text' and 'category'
df = pd.read_csv("insurance_tickets.csv")
Split into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")
Step 2: Prompt Engineering for Baseline Classification
Start with a simple zero-shot prompt to establish a baseline. This approach asks Claude to classify a ticket using only the category definitions.
def classify_ticket_zero_shot(ticket_text, categories):
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Baseline results: Expect around 70-75% accuracy. This is decent but not production-ready.
Step 3: Improving Accuracy with Few-Shot Examples
Adding a few carefully selected examples to your prompt can dramatically improve accuracy. This is called few-shot prompting.
def classify_ticket_few_shot(ticket_text, categories, examples):
example_text = ""
for ex in examples:
example_text += f"Ticket: {ex['text']}\nCategory: {ex['category']}\n\n"
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Here are some examples:
{example_text}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Results: Accuracy typically jumps to 80-85% with 3-5 well-chosen examples.
Step 4: Implementing Retrieval-Augmented Generation (RAG)
For maximum accuracy, use RAG to dynamically retrieve the most relevant examples for each query. This ensures Claude always has the best context.
import voyageai
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
Initialize embedding model
vo = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"))
Generate embeddings for training data
def get_embeddings(texts):
result = vo.embed(texts, model="voyage-2")
return result.embeddings
Pre-compute training embeddings
train_texts = train_df['ticket_text'].tolist()
train_embeddings = get_embeddings(train_texts)
def retrieve_similar_examples(query, k=3):
query_embedding = get_embeddings([query])[0]
similarities = cosine_similarity([query_embedding], train_embeddings)[0]
top_indices = np.argsort(similarities)[-k:][::-1]
examples = []
for idx in top_indices:
examples.append({
'text': train_df.iloc[idx]['ticket_text'],
'category': train_df.iloc[idx]['category']
})
return examples
def classify_ticket_with_rag(ticket_text, categories):
# Retrieve most similar examples
examples = retrieve_similar_examples(ticket_text, k=3)
# Build prompt with retrieved examples
example_text = ""
for ex in examples:
example_text += f"Ticket: {ex['text']}\nCategory: {ex['category']}\n\n"
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Here are similar examples from our database:
{example_text}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Results: RAG pushes accuracy to 90-95% by providing the most contextually relevant examples.
Step 5: Adding Chain-of-Thought Reasoning
For the final accuracy boost, ask Claude to explain its reasoning before giving the final category. This reduces errors by forcing the model to think step-by-step.
def classify_ticket_with_cot(ticket_text, categories):
examples = retrieve_similar_examples(ticket_text, k=3)
example_text = ""
for ex in examples:
example_text += f"Ticket: {ex['text']}\nCategory: {ex['category']}\n\n"
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Here are similar examples from our database:
{example_text}
First, think step-by-step about which category best fits this ticket. Then, provide your final answer.
Ticket: {ticket_text}
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
full_response = response.content[0].text.strip()
# Extract the final category (assumes format "Category: X")
if "Category:" in full_response:
category = full_response.split("Category:")[-1].strip()
else:
category = full_response.split("\n")[-1].strip()
return category, full_response
Results: Chain-of-thought reasoning typically achieves 95%+ accuracy, with the added benefit of explainable classifications.
Testing and Evaluation
Now, let's evaluate our final system against the test dataset:
from sklearn.metrics import accuracy_score, classification_report
predictions = []
actuals = []
for idx, row in test_df.iterrows():
ticket = row['ticket_text']
true_category = row['category']
predicted_category, reasoning = classify_ticket_with_cot(
ticket,
get_category_definitions()
)
predictions.append(predicted_category)
actuals.append(true_category)
print(f"Ticket {idx}: Predicted={predicted_category}, Actual={true_category}")
Calculate accuracy
accuracy = accuracy_score(actuals, predictions)
print(f"\nOverall Accuracy: {accuracy:.2%}")
Detailed report
print("\nClassification Report:")
print(classification_report(actuals, predictions))
Best Practices for Production
- Monitor accuracy drift: Regularly evaluate your classifier against new labeled data to catch performance degradation.
- Cache embeddings: Pre-compute and store embeddings to reduce latency.
- Handle edge cases: Add a "None of the above" category for truly ambiguous tickets.
- Log reasoning: Store chain-of-thought explanations for auditability and debugging.
- Iterate on categories: Refine category definitions based on misclassifications.
Key Takeaways
- Start simple, then layer complexity: Begin with zero-shot prompting, add few-shot examples, then implement RAG and chain-of-thought reasoning for maximum accuracy.
- RAG dramatically improves accuracy: By dynamically retrieving the most relevant examples, you can achieve 90%+ accuracy even with limited training data.
- Chain-of-thought reasoning provides explainability: Claude's step-by-step reasoning not only improves accuracy but also makes classifications auditable and trustworthy.
- Prompt engineering is iterative: Expect to refine your prompts multiple times. Each iteration should target specific failure modes identified during evaluation.
- Claude handles complex business rules: Unlike traditional ML models, Claude can understand nuanced category definitions and edge cases without extensive feature engineering.