Guide2026-05-05

Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy

Learn to build a production-ready classification system using Claude, prompt engineering, and RAG. Achieve 95%+ accuracy on complex business rules with limited training data.

Quick Answer

This guide walks you through building an insurance support ticket classifier using Claude. You'll learn prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning to boost accuracy from 70% to over 95%—even with limited training data.

ClaudeClassificationPrompt EngineeringRAGInsurance

Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy

Classification is one of the most common and impactful use cases for large language models (LLMs). Whether you're routing support tickets, moderating content, or categorizing documents, getting classification right can save hours of manual work and improve customer satisfaction.

In this guide, you'll build a production-ready classification system using Claude that categorizes insurance support tickets into 10 distinct categories. You'll start with a simple prompt and progressively improve accuracy from roughly 70% to over 95% by combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning.

Why Use Claude for Classification?

Traditional machine learning classifiers require large amounts of labeled training data and struggle with complex business rules or edge cases. Claude excels here because:

Handles complex business rules without needing thousands of examples
Works with limited training data—sometimes just 10–20 examples per class
Provides natural language explanations for every classification decision
Adapts quickly to new categories or changing requirements

Prerequisites

Before diving in, make sure you have:

Python 3.11+ installed
An Anthropic API key
A VoyageAI API key (optional—embeddings can be pre-computed)
Basic familiarity with Python and classification concepts

Setup: Installing Dependencies

First, install the required packages:

pip install anthropic voyageai pandas matplotlib scikit-learn numpy

Then, set up your API keys and initialize the Claude client:

import os
from anthropic import Anthropic
Load API keys from environment
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
Set model name
MODEL_NAME = "claude-3-opus-20240229"

Step 1: Define Your Classification Problem

For this guide, we'll build an Insurance Support Ticket Classifier. The goal is to route incoming tickets to the right department by categorizing them into one of 10 categories. Here are the first four:

Billing Inquiries – Questions about invoices, charges, fees, and premiums
Policy Administration – Requests for policy changes, updates, or cancellations
Claims Assistance – Questions about the claims process and filing procedures
Coverage Explanations – Questions about what is covered under specific policy types

Note: The data and labels used in this example were synthetically generated by Claude 3 Opus.

Step 2: Build a Simple Baseline Classifier

Let's start with a straightforward prompt that asks Claude to classify a ticket into one of the defined categories:

def classify_ticket_baseline(ticket_text, categories):
    prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Ticket: {ticket_text}
Category:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=50,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text.strip()

This baseline will likely achieve around 70% accuracy. The problem? Claude has no context about what each category really means, and it has no examples to learn from.

Step 3: Improve Accuracy with Prompt Engineering

To boost accuracy, we need to provide:

Clear category definitions with examples of what each category includes
Output formatting instructions to ensure consistent responses
Few-shot examples showing correct classifications

Here's an improved prompt:

def classify_ticket_engineered(ticket_text, categories_with_definitions, examples):
    prompt = f"""You are an expert insurance support ticket classifier.
Categories and their definitions:
{categories_with_definitions}
Here are some examples of correctly classified tickets:
{examples}
Classify the following ticket. Respond with ONLY the category name.
Ticket: {ticket_text}
Category:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=50,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text.strip()

With clear definitions and 2–3 examples per category, accuracy typically jumps to 85–90%.

Step 4: Implement Retrieval-Augmented Generation (RAG)

For the biggest accuracy boost, we'll implement RAG. Instead of hardcoding examples, we'll store our training data in a vector database and retrieve the most relevant examples for each new ticket.

Create Embeddings and Store in a Vector Database

import voyageai
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
vo = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"))
Create embeddings for all training examples
training_texts = [example["text"] for example in training_data]
training_embeddings = vo.embed(training_texts, model="voyage-2").embeddings

Retrieve Relevant Examples at Classification Time

def retrieve_examples(query, training_embeddings, training_data, k=3):
    # Embed the query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Compute similarity scores
    similarities = cosine_similarity([query_embedding], training_embeddings)[0]
    
    # Get top-k most similar examples
    top_indices = np.argsort(similarities)[-k:][::-1]
    
    return [training_data[i] for i in top_indices]

Combine RAG with Prompt Engineering

def classify_ticket_rag(ticket_text, categories_with_definitions, training_embeddings, training_data):
    # Retrieve relevant examples
    relevant_examples = retrieve_examples(ticket_text, training_embeddings, training_data)
    
    # Format examples for the prompt
    examples_text = "\n".join([
        f"Ticket: {ex['text']}\nCategory: {ex['category']}"
        for ex in relevant_examples
    ])
    
    prompt = f"""You are an expert insurance support ticket classifier.
Categories and their definitions:
{categories_with_definitions}
Here are the most relevant examples for this ticket:
{examples_text}
Classify the following ticket. Respond with ONLY the category name.
Ticket: {ticket_text}
Category:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=50,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text.strip()

With RAG, accuracy consistently reaches 95%+ because Claude gets the most relevant examples for each query.

Step 5: Add Chain-of-Thought Reasoning for Explainability

One of Claude's superpowers is providing natural language explanations. By adding a chain-of-thought (CoT) step, we get both the classification and a justification:

def classify_ticket_cot(ticket_text, categories_with_definitions, training_embeddings, training_data):
    relevant_examples = retrieve_examples(ticket_text, training_embeddings, training_data)
    
    examples_text = "\n".join([
        f"Ticket: {ex['text']}\nCategory: {ex['category']}"
        for ex in relevant_examples
    ])
    
    prompt = f"""You are an expert insurance support ticket classifier.
Categories and their definitions:
{categories_with_definitions}
Relevant examples:
{examples_text}
Ticket: {ticket_text}
First, think step-by-step about which category this ticket belongs to. Then, provide your final answer.
Reasoning:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    
    full_response = response.content[0].text.strip()
    # Parse out the category from the reasoning
    # (In practice, you might ask Claude to output JSON with both fields)
    return full_response

Now you get both the classification and a human-readable explanation—critical for compliance and auditing in regulated industries like insurance.

Step 6: Evaluate Your Classifier

Finally, test your classifier against a held-out test set:

from sklearn.metrics import accuracy_score, classification_report
predictions = []
true_labels = []
for ticket in test_data:
    pred = classify_ticket_rag(
        ticket["text"],
        categories_with_definitions,
        training_embeddings,
        training_data
    )
    predictions.append(pred)
    true_labels.append(ticket["category"])
accuracy = accuracy_score(true_labels, predictions)
print(f"Accuracy: {accuracy:.2%}")
print(classification_report(true_labels, predictions))

With the full pipeline, you should see accuracy above 95% with clear per-category precision and recall metrics.

Key Takeaways

Start simple, then iterate. A baseline prompt gets ~70% accuracy. Adding clear definitions and few-shot examples boosts it to 85–90%. RAG pushes it past 95%.
RAG is a game-changer for classification. By retrieving the most relevant examples for each query, you give Claude the context it needs without overwhelming it with irrelevant data.
Chain-of-thought reasoning adds transparency. In regulated industries, being able to explain why a ticket was classified a certain way is just as important as the classification itself.
Claude handles complex business rules with minimal data. You don't need thousands of labeled examples—10–20 per category is often enough to build a highly accurate classifier.
This approach generalizes beyond insurance. The same pattern—prompt engineering + RAG + CoT—works for any classification problem, from content moderation to document routing.