Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Learn to build an insurance support ticket classifier using Claude, prompt engineering, RAG, and chain-of-thought reasoning. Achieve 95%+ accuracy with limited data.
This guide shows you how to build a high-accuracy insurance support ticket classifier using Claude. You'll learn prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning to improve classification accuracy from 70% to over 95%.
Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy
Classification is one of the most practical applications of large language models (LLMs) in business. Whether you're routing customer support tickets, moderating content, or categorizing documents, getting classification right can save hours of manual work and improve response times dramatically.
In this guide, you'll build an insurance support ticket classifier using Claude that starts at 70% accuracy and climbs to over 95% through a combination of prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning. By the end, you'll have a reusable framework for tackling complex classification problems with limited training data.
Prerequisites
Before diving in, make sure you have:
- Python 3.11+ installed
- An Anthropic API key
- A VoyageAI API key (optional — embeddings can be pre-computed)
- Basic familiarity with Python and classification concepts
Why Use Claude for Classification?
Traditional machine learning classifiers struggle with:
- Complex business rules that are hard to encode as features
- Limited or low-quality training data where deep learning models fail
- Explainability — black-box models can't justify their decisions
- Understanding natural language instructions for nuanced rules
- Performing well with few-shot examples (even 10–20 per class)
- Providing natural language explanations for every classification
Step 1: Setting Up Your Environment
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Next, load your API keys and configure the Claude client:
import os
from anthropic import Anthropic
Load API keys from environment variables
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
Set your model
MODEL_NAME = "claude-3-opus-20240229" # or claude-3-sonnet for speed
Step 2: Understanding the Problem — Insurance Support Tickets
We're building a classifier for an insurance company that receives thousands of support tickets daily. The tickets need to be sorted into 10 categories:
- Billing Inquiries — Questions about invoices, charges, fees, and premiums
- Policy Administration — Policy changes, cancellations, renewals
- Claims Assistance — Filing procedures, claim status, payout timelines
- Coverage Explanations — What's covered, limits, exclusions, deductibles
- Account Management — Login issues, profile updates, password resets
- Document Requests — Requesting policy documents, ID cards, certificates
- Complaints & Escalations — Dissatisfaction, complaints, escalation requests
- Fraud & Compliance — Reporting fraud, compliance questions
- Agent & Broker Support — Agent commissions, broker portal issues
- General Inquiries — Miscellaneous questions not fitting other categories
Step 3: Baseline Classification with Zero-Shot Prompting
Let's start simple. A zero-shot prompt asks Claude to classify a ticket without any examples:
def classify_ticket_zero_shot(ticket_text, categories):
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
Categories:
{chr(10).join([f'{i+1}. {cat}' for i, cat in enumerate(categories)])}
Ticket: {ticket_text}
Respond with only the category name."""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: Expect around 70% accuracy. Claude understands the categories but misses nuance — for example, confusing "billing inquiry" with "policy administration" when a ticket mentions both payment and a policy change.
Step 4: Improving with Few-Shot Prompting
Adding a few examples per category dramatically improves performance. Here's how to structure a few-shot prompt:
def classify_ticket_few_shot(ticket_text, categories, examples):
# Build examples string
example_str = ""
for cat, texts in examples.items():
for text in texts:
example_str += f"Ticket: {text}\nCategory: {cat}\n\n"
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories.
Categories:
{chr(10).join([f'- {cat}' for cat in categories])}
Here are some examples:
{example_str}
Ticket: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: Accuracy jumps to ~85%. The examples help Claude understand subtle distinctions, like the difference between a "coverage explanation" and a "claims assistance" ticket.
Step 5: Adding Chain-of-Thought Reasoning
Chain-of-thought (CoT) prompting forces Claude to reason step-by-step before outputting the final category. This is especially useful for ambiguous tickets:
def classify_ticket_cot(ticket_text, categories, examples):
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories.
Categories:
{chr(10).join([f'- {cat}' for cat in categories])}
Here are some examples:
{examples}
Ticket: {ticket_text}
First, think step-by-step about what the ticket is asking. Consider:
- What is the main topic or issue?
- What action is the customer requesting?
- Which category best matches this?
Then, output your final answer on a new line starting with "Category:".
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: Accuracy reaches ~90%. The reasoning step reduces errors from jumping to conclusions based on keywords.
Step 6: Retrieval-Augmented Generation (RAG) for Dynamic Examples
Instead of hardcoding examples, use a vector database to retrieve the most similar tickets from your training set for each query. This ensures Claude always gets the most relevant examples.
6.1 Create Embeddings
import voyageai
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
Embed your training data
train_texts = [ticket["text"] for ticket in training_data]
train_embeddings = vo.embed(train_texts, model="voyage-2").embeddings
6.2 Build a Vector Store
import numpy as np
from sklearn.neighbors import NearestNeighbors
Fit nearest neighbors model
nn_model = NearestNeighbors(n_neighbors=5, metric="cosine")
nn_model.fit(train_embeddings)
6.3 Retrieve and Classify
def classify_ticket_rag(ticket_text, categories, training_data, nn_model, vo_client):
# Embed the query
query_embedding = vo_client.embed([ticket_text], model="voyage-2").embeddings[0]
# Find nearest neighbors
distances, indices = nn_model.kneighbors([query_embedding])
# Build dynamic examples from retrieved tickets
examples = ""
for idx in indices[0]:
ticket = training_data[idx]
examples += f"Ticket: {ticket['text']}\nCategory: {ticket['category']}\n\n"
# Classify using few-shot + CoT
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories.
Categories:
{chr(10).join([f'- {cat}' for cat in categories])}
Here are similar tickets from our database:
{examples}
Ticket: {ticket_text}
First, reason step-by-step, then output your final answer starting with "Category:".
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Result: Accuracy reaches 95%+. The RAG approach ensures Claude always sees the most relevant examples, handling edge cases and rare categories effectively.
Step 7: Testing and Evaluation
To evaluate your classifier, run it against a held-out test set and compute accuracy:
def evaluate_classifier(classifier_fn, test_data, categories):
correct = 0
total = len(test_data)
for item in test_data:
predicted = classifier_fn(item["text"], categories)
if predicted == item["category"]:
correct += 1
accuracy = correct / total
print(f"Accuracy: {accuracy:.2%}")
return accuracy
Putting It All Together: The Complete Pipeline
Here's the final architecture:
- Data Preparation — Split your labeled data into training and test sets
- Embedding Generation — Create embeddings for all training tickets
- Vector Store — Build a nearest-neighbors index
- Classification Function — For each new ticket:
- Evaluation — Measure accuracy on test set
Key Takeaways
- Start simple, then iterate. Begin with zero-shot prompting, then add few-shot examples, chain-of-thought reasoning, and finally RAG for maximum accuracy.
- Chain-of-thought reasoning reduces ambiguity. Forcing Claude to explain its reasoning before outputting a category significantly reduces errors on borderline cases.
- RAG makes your classifier scalable. Instead of cramming all examples into a prompt, retrieve the most relevant ones dynamically. This handles thousands of categories and millions of examples.
- Claude excels with limited data. You can achieve 95%+ accuracy with as few as 20 examples per category, thanks to Claude's strong language understanding.
- Explainability is built-in. Every classification comes with a natural language explanation, making it easy to audit and debug your system.