Guide2026-05-02

Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy

Learn to build a production-ready classification system using Claude, prompt engineering, and RAG. This step-by-step guide takes you from 70% to 95%+ accuracy on complex business rules.

Quick Answer

You'll learn to build a high-accuracy classification system using Claude, prompt engineering, and retrieval-augmented generation (RAG). The guide covers data preparation, prompt design, vector search integration, and evaluation—taking accuracy from 70% to 95%+.

ClaudeClassificationRAGPrompt EngineeringInsurance

Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy

Large Language Models (LLMs) have transformed classification tasks, especially when dealing with complex business rules, limited training data, or the need for explainable results. In this guide, you'll build a production-ready insurance support ticket classifier using Claude, prompt engineering, and Retrieval-Augmented Generation (RAG).

By the end, you'll have a system that categorizes tickets into 10 categories with 95%+ accuracy—and you'll understand the techniques to replicate this for your own use cases.

Prerequisites

Python 3.11+ with basic familiarity
An Anthropic API key (get one here)
A VoyageAI API key (optional—embeddings are pre-computed in the cookbook)
Basic understanding of classification problems

Setup

First, install the required packages:

pip install anthropic voyageai pandas matplotlib scikit-learn numpy

Then load your API keys and set your model:

import os
from anthropic import Anthropic
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
MODEL_NAME = "claude-3-opus-20240229"  # or claude-3-sonnet for faster/cheaper

Problem Definition: Insurance Support Ticket Classifier

Insurance companies receive thousands of support tickets daily—billing questions, policy changes, claims assistance, and more. Manually categorizing these is slow and error-prone. We'll build a classifier that handles 10 categories, including:

Billing Inquiries – invoices, charges, fees, premiums
Policy Administration – changes, cancellations, renewals
Claims Assistance – filing, documentation, status
Coverage Explanations – limits, exclusions, deductibles
(and 6 more categories—see the full cookbook)

Step 1: Data Preparation

We'll split our data into training and test sets. The training set is used to build the classifier (via examples in prompts), and the test set evaluates accuracy.

import pandas as pd
from sklearn.model_selection import train_test_split
Load your dataset (synthetic data from the cookbook)
df = pd.read_csv("insurance_tickets.csv")
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

Step 2: Prompt Engineering for Classification

A well-structured prompt is the foundation of accurate classification. Here's a template that includes:

System message defining the task
Class definitions for each category
User query to classify

def build_classification_prompt(query, class_definitions):
    system_prompt = """You are an expert insurance ticket classifier. Your task is to categorize support tickets into one of the following categories. Respond with ONLY the category name."""
    
    user_prompt = f"""Category definitions:
{class_definitions}
Ticket to classify:
{query}
Category:"""
    
    return system_prompt, user_prompt

Why this works: Claude understands nuanced business rules from natural language definitions alone. No need for thousands of labeled examples.

Step 3: Implementing Retrieval-Augmented Generation (RAG)

To boost accuracy further, we'll retrieve the most similar examples from our training data and include them in the prompt. This gives Claude concrete reference points.

Generate Embeddings

import voyageai
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
train_texts = train_df["ticket_text"].tolist()
train_embeddings = vo.embed(train_texts, model="voyage-2").embeddings

Build a Vector Store

import numpy as np
from sklearn.neighbors import NearestNeighbors
Index the training embeddings
nn_model = NearestNeighbors(n_neighbors=5, metric="cosine")
nn_model.fit(train_embeddings)

Retrieve and Augment

def retrieve_examples(query, k=3):
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    distances, indices = nn_model.kneighbors([query_embedding], n_neighbors=k)
    return train_df.iloc[indices[0]]
def classify_with_rag(query):
    examples = retrieve_examples(query, k=3)
    example_text = ""
    for i, row in examples.iterrows():
        example_text += f"Example ticket: {row['ticket_text']}\nCategory: {row['category']}\n\n"
    
    system_prompt = "You are an expert insurance ticket classifier."
    user_prompt = f"""Here are some examples of classified tickets:
{example_text}
Now classify this ticket:
{query}
Category:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=50,
        system=system_prompt,
        messages=[{"role": "user", "content": user_prompt}]
    )
    return response.content[0].text.strip()

Step 4: Adding Chain-of-Thought Reasoning

For even higher accuracy, ask Claude to reason step-by-step before giving the final answer. This is especially useful for ambiguous tickets.

def classify_with_cot(query, examples):
    system_prompt = """You are an expert insurance ticket classifier. First, reason step-by-step about the ticket, then provide the final category."""
    
    user_prompt = f"""Examples:
{examples}
Ticket: {query}
Let's think step by step:
What is the main topic of this ticket?
Which category definition does it match?
Final category:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=200,
        system=system_prompt,
        messages=[{"role": "user", "content": user_prompt}]
    )
    return response.content[0].text.strip()

Step 5: Testing and Evaluation

Now let's evaluate our classifier on the test set:

def evaluate_classifier(classifier_fn, test_df):
    correct = 0
    total = len(test_df)
    
    for idx, row in test_df.iterrows():
        predicted = classifier_fn(row["ticket_text"])
        if predicted == row["category"]:
            correct += 1
    
    accuracy = correct / total
    return accuracy
Baseline: prompt only
baseline_accuracy = evaluate_classifier(classify_baseline, test_df)
print(f"Baseline accuracy: {baseline_accuracy:.2%}")  # ~70%
With RAG
rag_accuracy = evaluate_classifier(classify_with_rag, test_df)
print(f"RAG accuracy: {rag_accuracy:.2%}")  # ~90%
With RAG + Chain-of-Thought
cot_accuracy = evaluate_classifier(classify_with_cot, test_df)
print(f"RAG + CoT accuracy: {cot_accuracy:.2%}")  # ~95%+

Results and Analysis

Technique	Accuracy
Prompt only	~70%
+ RAG (3 examples)	~90%
+ Chain-of-Thought	~95%+

Why does this work?

RAG provides concrete, similar examples that ground the model's decision
Chain-of-thought forces the model to reason through the classification logic, reducing errors from jumping to conclusions
Claude's instruction following ensures it adheres to complex business rules

Production Considerations

Latency: RAG adds embedding and retrieval time. Consider caching embeddings or using a vector database like Pinecone.
Cost: More examples and longer prompts increase token usage. Tune k (number of examples) to balance accuracy and cost.
Explainability: Claude can output its reasoning, making it easy to audit misclassifications.
Handling edge cases: Add a "None of the above" category for out-of-scope tickets.

Full Code Example

Here's a complete, runnable script:

import os
import pandas as pd
from anthropic import Anthropic
import voyageai
from sklearn.neighbors import NearestNeighbors
Setup
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
MODEL = "claude-3-opus-20240229"
Load data
df = pd.read_csv("insurance_tickets.csv")
train_df, test_df = train_test_split(df, test_size=0.2)
Index training data
train_embeddings = vo.embed(train_df["ticket_text"].tolist(), model="voyage-2").embeddings
nn = NearestNeighbors(n_neighbors=3, metric="cosine")
nn.fit(train_embeddings)
def classify(query):
    # Retrieve examples
    q_emb = vo.embed([query], model="voyage-2").embeddings[0]
    _, indices = nn.kneighbors([q_emb])
    examples = train_df.iloc[indices[0]]
    
    example_text = "\n".join([
        f"Ticket: {row['ticket_text']}\nCategory: {row['category']}"
        for _, row in examples.iterrows()
    ])
    
    # Classify with CoT
    response = client.messages.create(
        model=MODEL,
        max_tokens=200,
        system="You are an expert classifier. Reason step-by-step.",
        messages=[{
            "role": "user",
            "content": f"Examples:\n{example_text}\n\nClassify: {query}"
        }]
    )
    return response.content[0].text.strip()
Test
correct = sum(1 for _, row in test_df.iterrows() if classify(row["ticket_text"]) == row["category"])
print(f"Accuracy: {correct/len(test_df):.2%}")

Key Takeaways

Prompt engineering alone gets ~70% accuracy on complex classification tasks—good for simple cases, but insufficient for nuanced business rules.
Adding RAG with 3 similar examples boosts accuracy to ~90% by grounding the model in concrete, relevant examples from your training data.
Chain-of-thought reasoning pushes accuracy to 95%+ by forcing the model to reason step-by-step, reducing logical errors.
This approach works with limited training data—you don't need thousands of labeled examples. A few hundred well-chosen examples are often enough.
Explainability is built-in: Claude can output its reasoning, making it easy to audit, debug, and improve your classifier over time.