BeClaude
Guide2026-05-02

Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy

Learn to build a production-ready classification system using Claude, prompt engineering, and RAG. This step-by-step guide takes you from 70% to 95%+ accuracy on complex business rules.

Quick Answer

You'll learn to build a high-accuracy classification system using Claude, prompt engineering, and retrieval-augmented generation (RAG). The guide covers data preparation, prompt design, vector search integration, and evaluation—taking accuracy from 70% to 95%+.

ClaudeClassificationRAGPrompt EngineeringInsurance

Building a High-Accuracy Insurance Ticket Classifier with Claude: From 70% to 95%+ Accuracy

Large Language Models (LLMs) have transformed classification tasks, especially when dealing with complex business rules, limited training data, or the need for explainable results. In this guide, you'll build a production-ready insurance support ticket classifier using Claude, prompt engineering, and Retrieval-Augmented Generation (RAG).

By the end, you'll have a system that categorizes tickets into 10 categories with 95%+ accuracy—and you'll understand the techniques to replicate this for your own use cases.

Prerequisites

  • Python 3.11+ with basic familiarity
  • An Anthropic API key (get one here)
  • A VoyageAI API key (optional—embeddings are pre-computed in the cookbook)
  • Basic understanding of classification problems

Setup

First, install the required packages:

pip install anthropic voyageai pandas matplotlib scikit-learn numpy

Then load your API keys and set your model:

import os
from anthropic import Anthropic

anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY") client = Anthropic(api_key=anthropic_api_key)

MODEL_NAME = "claude-3-opus-20240229" # or claude-3-sonnet for faster/cheaper

Problem Definition: Insurance Support Ticket Classifier

Insurance companies receive thousands of support tickets daily—billing questions, policy changes, claims assistance, and more. Manually categorizing these is slow and error-prone. We'll build a classifier that handles 10 categories, including:

  • Billing Inquiries – invoices, charges, fees, premiums
  • Policy Administration – changes, cancellations, renewals
  • Claims Assistance – filing, documentation, status
  • Coverage Explanations – limits, exclusions, deductibles
  • (and 6 more categories—see the full cookbook)

Step 1: Data Preparation

We'll split our data into training and test sets. The training set is used to build the classifier (via examples in prompts), and the test set evaluates accuracy.

import pandas as pd
from sklearn.model_selection import train_test_split

Load your dataset (synthetic data from the cookbook)

df = pd.read_csv("insurance_tickets.csv") train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

Step 2: Prompt Engineering for Classification

A well-structured prompt is the foundation of accurate classification. Here's a template that includes:

  • System message defining the task
  • Class definitions for each category
  • User query to classify
def build_classification_prompt(query, class_definitions):
    system_prompt = """You are an expert insurance ticket classifier. Your task is to categorize support tickets into one of the following categories. Respond with ONLY the category name."""
    
    user_prompt = f"""Category definitions:
{class_definitions}

Ticket to classify: {query}

Category:""" return system_prompt, user_prompt

Why this works: Claude understands nuanced business rules from natural language definitions alone. No need for thousands of labeled examples.

Step 3: Implementing Retrieval-Augmented Generation (RAG)

To boost accuracy further, we'll retrieve the most similar examples from our training data and include them in the prompt. This gives Claude concrete reference points.

Generate Embeddings

import voyageai

vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"]) train_texts = train_df["ticket_text"].tolist() train_embeddings = vo.embed(train_texts, model="voyage-2").embeddings

Build a Vector Store

import numpy as np
from sklearn.neighbors import NearestNeighbors

Index the training embeddings

nn_model = NearestNeighbors(n_neighbors=5, metric="cosine") nn_model.fit(train_embeddings)

Retrieve and Augment

def retrieve_examples(query, k=3):
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    distances, indices = nn_model.kneighbors([query_embedding], n_neighbors=k)
    return train_df.iloc[indices[0]]

def classify_with_rag(query): examples = retrieve_examples(query, k=3) example_text = "" for i, row in examples.iterrows(): example_text += f"Example ticket: {row['ticket_text']}\nCategory: {row['category']}\n\n" system_prompt = "You are an expert insurance ticket classifier." user_prompt = f"""Here are some examples of classified tickets: {example_text}

Now classify this ticket: {query}

Category:""" response = client.messages.create( model=MODEL_NAME, max_tokens=50, system=system_prompt, messages=[{"role": "user", "content": user_prompt}] ) return response.content[0].text.strip()

Step 4: Adding Chain-of-Thought Reasoning

For even higher accuracy, ask Claude to reason step-by-step before giving the final answer. This is especially useful for ambiguous tickets.

def classify_with_cot(query, examples):
    system_prompt = """You are an expert insurance ticket classifier. First, reason step-by-step about the ticket, then provide the final category."""
    
    user_prompt = f"""Examples:
{examples}

Ticket: {query}

Let's think step by step:

  • What is the main topic of this ticket?
  • Which category definition does it match?
  • Final category:"""
response = client.messages.create( model=MODEL_NAME, max_tokens=200, system=system_prompt, messages=[{"role": "user", "content": user_prompt}] ) return response.content[0].text.strip()

Step 5: Testing and Evaluation

Now let's evaluate our classifier on the test set:

def evaluate_classifier(classifier_fn, test_df):
    correct = 0
    total = len(test_df)
    
    for idx, row in test_df.iterrows():
        predicted = classifier_fn(row["ticket_text"])
        if predicted == row["category"]:
            correct += 1
    
    accuracy = correct / total
    return accuracy

Baseline: prompt only

baseline_accuracy = evaluate_classifier(classify_baseline, test_df) print(f"Baseline accuracy: {baseline_accuracy:.2%}") # ~70%

With RAG

rag_accuracy = evaluate_classifier(classify_with_rag, test_df) print(f"RAG accuracy: {rag_accuracy:.2%}") # ~90%

With RAG + Chain-of-Thought

cot_accuracy = evaluate_classifier(classify_with_cot, test_df) print(f"RAG + CoT accuracy: {cot_accuracy:.2%}") # ~95%+

Results and Analysis

TechniqueAccuracy
Prompt only~70%
+ RAG (3 examples)~90%
+ Chain-of-Thought~95%+
Why does this work?
  • RAG provides concrete, similar examples that ground the model's decision
  • Chain-of-thought forces the model to reason through the classification logic, reducing errors from jumping to conclusions
  • Claude's instruction following ensures it adheres to complex business rules

Production Considerations

  • Latency: RAG adds embedding and retrieval time. Consider caching embeddings or using a vector database like Pinecone.
  • Cost: More examples and longer prompts increase token usage. Tune k (number of examples) to balance accuracy and cost.
  • Explainability: Claude can output its reasoning, making it easy to audit misclassifications.
  • Handling edge cases: Add a "None of the above" category for out-of-scope tickets.

Full Code Example

Here's a complete, runnable script:

import os
import pandas as pd
from anthropic import Anthropic
import voyageai
from sklearn.neighbors import NearestNeighbors

Setup

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"]) MODEL = "claude-3-opus-20240229"

Load data

df = pd.read_csv("insurance_tickets.csv") train_df, test_df = train_test_split(df, test_size=0.2)

Index training data

train_embeddings = vo.embed(train_df["ticket_text"].tolist(), model="voyage-2").embeddings nn = NearestNeighbors(n_neighbors=3, metric="cosine") nn.fit(train_embeddings)

def classify(query): # Retrieve examples q_emb = vo.embed([query], model="voyage-2").embeddings[0] _, indices = nn.kneighbors([q_emb]) examples = train_df.iloc[indices[0]] example_text = "\n".join([ f"Ticket: {row['ticket_text']}\nCategory: {row['category']}" for _, row in examples.iterrows() ]) # Classify with CoT response = client.messages.create( model=MODEL, max_tokens=200, system="You are an expert classifier. Reason step-by-step.", messages=[{ "role": "user", "content": f"Examples:\n{example_text}\n\nClassify: {query}" }] ) return response.content[0].text.strip()

Test

correct = sum(1 for _, row in test_df.iterrows() if classify(row["ticket_text"]) == row["category"]) print(f"Accuracy: {correct/len(test_df):.2%}")

Key Takeaways

  • Prompt engineering alone gets ~70% accuracy on complex classification tasks—good for simple cases, but insufficient for nuanced business rules.
  • Adding RAG with 3 similar examples boosts accuracy to ~90% by grounding the model in concrete, relevant examples from your training data.
  • Chain-of-thought reasoning pushes accuracy to 95%+ by forcing the model to reason step-by-step, reducing logical errors.
  • This approach works with limited training data—you don't need thousands of labeled examples. A few hundred well-chosen examples are often enough.
  • Explainability is built-in: Claude can output its reasoning, making it easy to audit, debug, and improve your classifier over time.