Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Learn to build a production-ready classification system using Claude, prompt engineering, RAG, and chain-of-thought reasoning. Achieve 95%+ accuracy on complex business rules with limited data.
This guide shows you how to build a high-accuracy insurance support ticket classifier using Claude. You'll learn to combine prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning to improve classification accuracy from 70% to over 95%, even with limited training data.
Introduction
Classification is one of the most common and impactful tasks in business automation. Whether you're routing support tickets, moderating content, or categorizing customer feedback, getting the labels right matters. Traditional machine learning approaches often struggle with complex business rules, limited training data, and the need for explainable results.
Large Language Models (LLMs) like Claude have changed the game. They can handle nuanced classification problems, work with minimal examples, and provide natural language justifications for their decisions. In this guide, you'll build a production-ready insurance support ticket classifier that starts at 70% accuracy and reaches over 95% by combining three powerful techniques:
- Prompt engineering to define clear class boundaries
- Retrieval-Augmented Generation (RAG) to inject relevant examples
- Chain-of-thought reasoning to improve decision quality
Prerequisites
Before diving in, make sure you have:
- Python 3.11+ installed
- An Anthropic API key
- Basic familiarity with Python and classification concepts
- (Optional) A VoyageAI API key for custom embeddings (pre-computed embeddings are provided)
Setup and Installation
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Now load your API keys and set up the Claude client:
import os
from anthropic import Anthropic
Load API keys from environment variables
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
Set your model
MODEL_NAME = "claude-3-opus-20240229" # or claude-3-sonnet for faster/cheaper
Problem Definition: Insurance Support Ticket Classification
Insurance companies receive thousands of support tickets daily. Manually categorizing them is slow and error-prone. In this guide, we'll classify tickets into 10 categories:
- Billing Inquiries – Questions about invoices, charges, premiums, payment methods
- Policy Administration – Policy changes, cancellations, renewals
- Claims Assistance – Claims process, documentation, status inquiries
- Coverage Explanations – What's covered, limits, exclusions, deductibles
- Account Management – Login issues, profile updates, password resets
- Underwriting – Risk assessment, policy issuance, medical history
- Fraud & Compliance – Suspicious activity, regulatory questions
- Agent/Broker Support – Commission questions, licensing, tools
- Product Information – New products, features, comparisons
- General Inquiry – Anything that doesn't fit above
Step 1: Data Preparation
Proper data preparation is critical. You need:
- Training data: Labeled examples that define each category
- Test data: Held-out examples to evaluate performance
import pandas as pd
from sklearn.model_selection import train_test_split
Load your dataset (example structure)
df = pd.read_csv("insurance_tickets.csv")
Split into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['category'])
print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")
Step 2: Baseline Prompt Engineering
Start with a simple prompt that defines the categories and asks Claude to classify. This gives you a baseline accuracy (typically around 70% for complex tasks).
def classify_ticket_baseline(ticket_text, categories):
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into one of these categories:
{categories}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: ~70% accuracy. Not bad, but not production-ready.
Step 3: Adding Chain-of-Thought Reasoning
Chain-of-thought (CoT) prompting dramatically improves accuracy by forcing the model to reason step-by-step before outputting a label.
def classify_ticket_cot(ticket_text, categories):
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into one of these categories:
{categories}
First, think step-by-step about what the ticket is asking. Consider:
- What is the main topic or issue?
- What action is the customer requesting?
- Which category best matches?
Then output your final answer on a new line starting with "Category:".
Ticket: {ticket_text}
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
full_response = response.content[0].text.strip()
# Extract the category from the response
if "Category:" in full_response:
return full_response.split("Category:")[-1].strip()
return full_response
Result: ~80% accuracy. The reasoning step helps Claude disambiguate similar categories.
Step 4: Implementing Retrieval-Augmented Generation (RAG)
RAG supercharges your classifier by retrieving the most similar training examples and including them in the prompt as few-shot examples. This is especially powerful when you have limited training data.
Create a Vector Database
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
For simplicity, we'll use pre-computed embeddings
In production, use VoyageAI or another embedding provider
def get_embedding(text):
# Replace with actual embedding API call
# Example: response = voyage_client.embed(text, model="voyage-2")
pass
Build a simple in-memory vector store
train_embeddings = np.array([get_embedding(text) for text in train_df['ticket_text']])
def retrieve_similar_examples(query, k=3):
query_embedding = get_embedding(query).reshape(1, -1)
similarities = cosine_similarity(query_embedding, train_embeddings)[0]
top_indices = np.argsort(similarities)[-k:][::-1]
return train_df.iloc[top_indices]
Augment the Prompt with Retrieved Examples
def classify_ticket_rag(ticket_text, categories, k=3):
# Retrieve similar examples
similar = retrieve_similar_examples(ticket_text, k)
# Format examples
examples = ""
for _, row in similar.iterrows():
examples += f"Ticket: {row['ticket_text']}\nCategory: {row['category']}\n\n"
prompt = f"""You are an insurance support ticket classifier. Classify the following ticket into one of these categories:
{categories}
Here are some similar examples for reference:
{examples}
First, think step-by-step about what the ticket is asking. Consider:
- What is the main topic or issue?
- What action is the customer requesting?
- How does this compare to the examples above?
- Which category best matches?
Then output your final answer on a new line starting with "Category:".
Ticket: {ticket_text}
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
full_response = response.content[0].text.strip()
if "Category:" in full_response:
return full_response.split("Category:")[-1].strip()
return full_response
Result: ~90% accuracy. The retrieved examples provide concrete context that helps Claude understand subtle category boundaries.
Step 5: Evaluation and Iteration
Now let's evaluate our classifier systematically:
from sklearn.metrics import accuracy_score, classification_report
def evaluate_classifier(classifier_fn, test_df, categories):
predictions = []
for ticket in test_df['ticket_text']:
pred = classifier_fn(ticket, categories)
predictions.append(pred)
accuracy = accuracy_score(test_df['category'], predictions)
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(test_df['category'], predictions))
return accuracy
Evaluate the RAG + CoT classifier
accuracy = evaluate_classifier(classify_ticket_rag, test_df, category_definitions)
Advanced Tips for Pushing to 95%+
To reach the highest accuracy levels, consider these refinements:
1. Dynamic k Selection
Not all queries need the same number of examples. Experiment with retrieving 3-5 examples for ambiguous cases and fewer for clear ones.2. Category-Specific Examples
If certain categories are frequently confused (e.g., "Billing Inquiries" vs. "Policy Administration"), ensure your retrieval includes examples from both categories.3. Confidence Thresholds
Add a confidence score to your classifier and flag low-confidence predictions for human review:def classify_with_confidence(ticket_text, categories):
prompt = f"""... [same prompt as above] ...
After your reasoning, output:
Category: [category]
Confidence: [0-100]
"""
# Parse confidence from response
# Flag anything below 80 for manual review
4. Ensemble with Traditional ML
Combine Claude's predictions with a traditional classifier (e.g., SVM or Random Forest) and use a voting mechanism for final decisions.Complete Workflow Summary
Here's the end-to-end pipeline you've built:
- Prepare data – Split into train/test sets
- Baseline prompt – Simple classification (~70% accuracy)
- Add chain-of-thought – Step-by-step reasoning (~80% accuracy)
- Implement RAG – Retrieve similar examples (~90% accuracy)
- Iterate and refine – Tune k, add confidence thresholds, handle edge cases (95%+ accuracy)
Key Takeaways
- Start simple, then layer complexity: Begin with a basic prompt, then add chain-of-thought reasoning, then RAG. Each layer adds 10-15% accuracy.
- RAG is a game-changer for limited data: By retrieving similar examples at inference time, you can achieve high accuracy with surprisingly small training sets.
- Chain-of-thought reasoning improves explainability: Not only does it boost accuracy, but it also provides a natural language justification that helps debug misclassifications.
- Always evaluate systematically: Use held-out test data and a classification report to understand where your system excels and where it struggles.
- Consider human-in-the-loop for edge cases: Implement confidence thresholds to flag uncertain predictions for manual review, especially in high-stakes domains like insurance.