Guide2026-05-06

Building a High-Accuracy Insurance Ticket Classifier with Claude

Learn to build a production-ready classification system using Claude, prompt engineering, and RAG. Achieve 95%+ accuracy on complex business rules with limited training data.

Quick Answer

This guide teaches you to build a high-accuracy classification system using Claude, combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning to categorize insurance support tickets into 10 categories with 95%+ accuracy.

ClaudeClassificationRAGPrompt EngineeringInsurance

Building a High-Accuracy Insurance Ticket Classifier with Claude

Classification is one of the most common and valuable applications of large language models (LLMs) in business. Traditional machine learning approaches often struggle with complex business rules, limited training data, and the need for explainable results. Claude excels in these scenarios.

In this guide, you'll build a production-ready classification system that categorizes insurance support tickets into 10 distinct categories. You'll learn how to progressively improve classification accuracy from a baseline of ~70% to over 95% by combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning.

Prerequisites

Before starting, ensure you have:

Python 3.11+ installed
An Anthropic API key
Basic familiarity with Python and classification concepts
(Optional) A VoyageAI API key for generating embeddings

Setup and Installation

First, install the required packages:

pip install anthropic voyageai pandas matplotlib scikit-learn numpy

Next, set up your API keys and initialize the Claude client:

import os
from anthropic import Anthropic
Load API keys from environment variables
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
Set your model
MODEL_NAME = "claude-3-opus-20240229"

Understanding the Problem

Insurance companies receive thousands of support tickets daily. Manually categorizing these tickets is slow, expensive, and error-prone. Our goal is to build a system that automatically classifies tickets into categories like:

Billing Inquiries – Questions about invoices, charges, premiums, and payment methods
Policy Administration – Requests for policy changes, cancellations, or renewals
Claims Assistance – Questions about filing claims, documentation, and status
Coverage Explanations – Clarifications on what is covered, limits, and exclusions
(and 6 more categories)

Step 1: Data Preparation

Proper data preparation is the foundation of any good classification system. You'll need two datasets:

Training data: Labeled examples used to build and refine the classifier
Test data: Unseen examples used to evaluate performance

import pandas as pd
from sklearn.model_selection import train_test_split
Load your dataset
Assume df has columns: 'ticket_text' and 'category'
df = pd.read_csv("insurance_tickets.csv")
Split into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")

Step 2: Prompt Engineering for Baseline Classification

Start with a simple zero-shot prompt to establish a baseline. This approach asks Claude to classify a ticket using only the category definitions.

def classify_ticket_zero_shot(ticket_text, categories):
    prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Ticket: {ticket_text}
Category:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text.strip()

Baseline results: Expect around 70-75% accuracy. This is decent but not production-ready.

Step 3: Improving Accuracy with Few-Shot Examples

Adding a few carefully selected examples to your prompt can dramatically improve accuracy. This is called few-shot prompting.

def classify_ticket_few_shot(ticket_text, categories, examples):
    example_text = ""
    for ex in examples:
        example_text += f"Ticket: {ex['text']}\nCategory: {ex['category']}\n\n"
    
    prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Here are some examples:
{example_text}
Ticket: {ticket_text}
Category:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text.strip()

Results: Accuracy typically jumps to 80-85% with 3-5 well-chosen examples.

Step 4: Implementing Retrieval-Augmented Generation (RAG)

For maximum accuracy, use RAG to dynamically retrieve the most relevant examples for each query. This ensures Claude always has the best context.

import voyageai
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
Initialize embedding model
vo = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"))
Generate embeddings for training data
def get_embeddings(texts):
    result = vo.embed(texts, model="voyage-2")
    return result.embeddings
Pre-compute training embeddings
train_texts = train_df['ticket_text'].tolist()
train_embeddings = get_embeddings(train_texts)
def retrieve_similar_examples(query, k=3):
    query_embedding = get_embeddings([query])[0]
    similarities = cosine_similarity([query_embedding], train_embeddings)[0]
    top_indices = np.argsort(similarities)[-k:][::-1]
    
    examples = []
    for idx in top_indices:
        examples.append({
            'text': train_df.iloc[idx]['ticket_text'],
            'category': train_df.iloc[idx]['category']
        })
    return examples
def classify_ticket_with_rag(ticket_text, categories):
    # Retrieve most similar examples
    examples = retrieve_similar_examples(ticket_text, k=3)
    
    # Build prompt with retrieved examples
    example_text = ""
    for ex in examples:
        example_text += f"Ticket: {ex['text']}\nCategory: {ex['category']}\n\n"
    
    prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Here are similar examples from our database:
{example_text}
Ticket: {ticket_text}
Category:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text.strip()

Results: RAG pushes accuracy to 90-95% by providing the most contextually relevant examples.

Step 5: Adding Chain-of-Thought Reasoning

For the final accuracy boost, ask Claude to explain its reasoning before giving the final category. This reduces errors by forcing the model to think step-by-step.

def classify_ticket_with_cot(ticket_text, categories):
    examples = retrieve_similar_examples(ticket_text, k=3)
    
    example_text = ""
    for ex in examples:
        example_text += f"Ticket: {ex['text']}\nCategory: {ex['category']}\n\n"
    
    prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one of these categories:
{categories}
Here are similar examples from our database:
{example_text}
First, think step-by-step about which category best fits this ticket. Then, provide your final answer.
Ticket: {ticket_text}
Reasoning:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=300,
        messages=[{"role": "user", "content": prompt}]
    )
    
    full_response = response.content[0].text.strip()
    
    # Extract the final category (assumes format "Category: X")
    if "Category:" in full_response:
        category = full_response.split("Category:")[-1].strip()
    else:
        category = full_response.split("\n")[-1].strip()
    
    return category, full_response

Results: Chain-of-thought reasoning typically achieves 95%+ accuracy, with the added benefit of explainable classifications.

Testing and Evaluation

Now, let's evaluate our final system against the test dataset:

from sklearn.metrics import accuracy_score, classification_report
predictions = []
actuals = []
for idx, row in test_df.iterrows():
    ticket = row['ticket_text']
    true_category = row['category']
    
    predicted_category, reasoning = classify_ticket_with_cot(
        ticket, 
        get_category_definitions()
    )
    
    predictions.append(predicted_category)
    actuals.append(true_category)
    
    print(f"Ticket {idx}: Predicted={predicted_category}, Actual={true_category}")
Calculate accuracy
accuracy = accuracy_score(actuals, predictions)
print(f"\nOverall Accuracy: {accuracy:.2%}")
Detailed report
print("\nClassification Report:")
print(classification_report(actuals, predictions))

Best Practices for Production

Monitor accuracy drift: Regularly evaluate your classifier against new labeled data to catch performance degradation.
Cache embeddings: Pre-compute and store embeddings to reduce latency.
Handle edge cases: Add a "None of the above" category for truly ambiguous tickets.
Log reasoning: Store chain-of-thought explanations for auditability and debugging.
Iterate on categories: Refine category definitions based on misclassifications.

Key Takeaways

Start simple, then layer complexity: Begin with zero-shot prompting, add few-shot examples, then implement RAG and chain-of-thought reasoning for maximum accuracy.
RAG dramatically improves accuracy: By dynamically retrieving the most relevant examples, you can achieve 90%+ accuracy even with limited training data.
Chain-of-thought reasoning provides explainability: Claude's step-by-step reasoning not only improves accuracy but also makes classifications auditable and trustworthy.
Prompt engineering is iterative: Expect to refine your prompts multiple times. Each iteration should target specific failure modes identified during evaluation.
Claude handles complex business rules: Unlike traditional ML models, Claude can understand nuanced category definitions and edge cases without extensive feature engineering.