GuideBeginnerBest Practices2026-05-15

Building a High-Accuracy Classification System with Claude: From 70% to 95%+ Accuracy

Learn how to build a production-ready classification system using Claude, prompt engineering, and RAG. This guide walks through improving accuracy from 70% to 95%+ with practical code examples.

Quick Answer

This guide teaches you to build a high-accuracy classification system with Claude by combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning. You'll learn to improve accuracy from 70% to 95%+ using practical Python code examples.

classificationprompt-engineeringRAGchain-of-thoughtAnthropic API

Building a High-Accuracy Classification System with Claude: From 70% to 95%+ Accuracy

Classification is a cornerstone of many business applications, from routing support tickets to moderating content. Traditional machine learning approaches often struggle with complex business rules, limited training data, and the need for explainable results. Large Language Models (LLMs) like Claude offer a powerful alternative.

In this guide, you'll build a production-ready classification system that categorizes insurance support tickets into 10 categories. You'll learn how to progressively improve classification accuracy from a baseline of 70% to over 95% by combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning.

Prerequisites

Python 3.11+ with basic familiarity
An Anthropic API key (required)
A VoyageAI API key (optional — embeddings are pre-computed in the cookbook)
Basic understanding of classification problems

Setup

First, install the required packages:

pip install anthropic voyageai pandas matplotlib scikit-learn numpy

Next, set up your API keys and model configuration:

import os
from anthropic import Anthropic
Load API keys from environment variables
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
client = Anthropic(api_key=anthropic_api_key)
Set your model
MODEL_NAME = "claude-3-opus-20240229"

Why Use LLMs for Classification?

Traditional machine learning classifiers require large amounts of labeled data, extensive feature engineering, and often produce black-box results. LLMs like Claude excel in scenarios where:

Complex business rules need to be interpreted and applied
Training data is limited or low-quality
Explainability is required — Claude can provide natural language justifications for its decisions
Categories evolve frequently and need quick updates

Step 1: Data Preparation

Proper data preparation is crucial. You'll need:

Training data: Used to build the classification model (via examples in prompts)
Test data: Used to evaluate performance

For this insurance ticket classifier, the data includes 10 categories:

Billing Inquiries — Questions about invoices, charges, fees, and premiums
Policy Administration — Requests for policy changes, updates, or cancellations
Claims Assistance — Questions about the claims process and filing procedures
Coverage Explanations — Questions about what is covered under specific policy types
Account Management — Login issues, profile updates, and account access
Underwriting Questions — Risk assessment, policy issuance, and eligibility
Fraud and Compliance — Reporting suspicious activity or compliance concerns
Agent and Broker Support — Assistance for agents and brokers
Product and Service Feedback — Complaints, suggestions, and testimonials
General Inquiries — Miscellaneous questions not covered by other categories

Load your data into a pandas DataFrame:

import pandas as pd
Load training and test data
train_df = pd.read_csv('insurance_tickets_train.csv')
test_df = pd.read_csv('insurance_tickets_test.csv')
print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")

Step 2: Prompt Engineering

Prompt engineering is the foundation of LLM-based classification. A well-crafted prompt includes:

System instructions: Define the task and output format
Category definitions: Clear descriptions of each class
Examples: Few-shot examples to guide the model
User query: The ticket to classify

Here's a basic prompt template:

SYSTEM_PROMPT = """You are an insurance support ticket classifier. Your task is to classify each ticket into exactly one of the following categories:
Billing Inquiries
Policy Administration
Claims Assistance
Coverage Explanations
Account Management
Underwriting Questions
Fraud and Compliance
Agent and Broker Support
Product and Service Feedback
General Inquiries

Respond with only the category number and name, nothing else."""
def classify_ticket(ticket_text):
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=100,
        system=SYSTEM_PROMPT,
        messages=[
            {"role": "user", "content": f"Classify this ticket: {ticket_text}"}
        ]
    )
    return response.content[0].text

This baseline approach typically achieves around 70% accuracy. Let's improve it.

Step 3: Implementing Retrieval-Augmented Generation (RAG)

RAG dramatically improves accuracy by providing Claude with relevant examples from your training data. The idea is simple: for each new ticket, find the most similar tickets from your training set and include them as few-shot examples in the prompt.

Create a Vector Database

First, generate embeddings for your training data:

import voyageai
vo = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"))
Generate embeddings for training data
train_texts = train_df['ticket_text'].tolist()
train_embeddings = vo.embed(train_texts, model="voyage-2").embeddings

Implement Similarity Search

When a new ticket comes in, find the most similar training examples:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def find_similar_tickets(query, k=3):
    # Embed the query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Calculate similarities
    similarities = cosine_similarity([query_embedding], train_embeddings)[0]
    
    # Get top-k indices
    top_indices = np.argsort(similarities)[-k:][::-1]
    
    return train_df.iloc[top_indices]

Augment the Prompt

Now, include these similar examples in your prompt:

def classify_with_rag(ticket_text):
    # Find similar tickets
    similar = find_similar_tickets(ticket_text, k=3)
    
    # Build examples string
    examples = ""
    for _, row in similar.iterrows():
        examples += f"Ticket: {row['ticket_text']}\nCategory: {row['category']}\n\n"
    
    prompt = f"""Here are examples of classified tickets:
{examples}
Now classify this ticket:
Ticket: {ticket_text}
Category:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=100,
        system=SYSTEM_PROMPT,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return response.content[0].text

This RAG approach typically boosts accuracy to 85-90%.

Step 4: Adding Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting asks Claude to reason step-by-step before giving the final answer. This is particularly useful for ambiguous tickets that could fit multiple categories.

def classify_with_cot(ticket_text):
    similar = find_similar_tickets(ticket_text, k=3)
    
    examples = ""
    for _, row in similar.iterrows():
        examples += f"Ticket: {row['ticket_text']}\nCategory: {row['category']}\n\n"
    
    prompt = f"""Here are examples of classified tickets:
{examples}
Now classify this ticket. First, reason step-by-step about which category fits best. Then provide your final answer on a new line starting with 'Category:'.
Ticket: {ticket_text}
Reasoning:"""
    
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=300,
        system=SYSTEM_PROMPT,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return response.content[0].text

Combining RAG with chain-of-thought reasoning pushes accuracy to 95%+.

Step 5: Testing and Evaluation

To evaluate your classifier, run it against your test set and compare predictions to ground truth:

from sklearn.metrics import accuracy_score, classification_report
def evaluate_classifier(classify_func, test_df):
    predictions = []
    for _, row in test_df.iterrows():
        pred = classify_func(row['ticket_text'])
        predictions.append(extract_category(pred))  # Helper to parse response
    
    accuracy = accuracy_score(test_df['category'], predictions)
    print(f"Accuracy: {accuracy:.2%}")
    print(classification_report(test_df['category'], predictions))
    return accuracy

Putting It All Together

Here's the complete pipeline:

def final_classifier(ticket_text):
    """
    High-accuracy classifier combining RAG and chain-of-thought.
    """
    # Step 1: Find similar examples
    similar = find_similar_tickets(ticket_text, k=5)
    
    # Step 2: Build prompt with examples and CoT instructions
    examples = "\n\n".join([
        f"Ticket: {row['ticket_text']}\nCategory: {row['category']}"
        for _, row in similar.iterrows()
    ])
    
    prompt = f"""You are an expert insurance ticket classifier.
Category definitions:
Billing Inquiries: Questions about invoices, charges, fees, and premiums
Policy Administration: Requests for policy changes, updates, or cancellations
Claims Assistance: Questions about the claims process and filing procedures
... (all 10 categories)
Relevant examples:
{examples}
Classify the following ticket. Think step-by-step:
Ticket: {ticket_text}
Reasoning:"""
    
    # Step 3: Get response from Claude
    response = client.messages.create(
        model=MODEL_NAME,
        max_tokens=300,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.content[0].text

Key Takeaways

Prompt engineering is the foundation: Start with clear category definitions and output formatting instructions. This alone can achieve ~70% accuracy.
RAG dramatically improves accuracy: By retrieving and including similar examples from your training data, you can boost accuracy to 85-90% without retraining.
Chain-of-thought reasoning adds the final edge: Asking Claude to reason step-by-step before outputting the final category pushes accuracy to 95%+ and provides explainable results.
This approach works with limited data: Unlike traditional ML classifiers that require thousands of labeled examples, this method works well with just dozens or hundreds of examples.
Explainability is built-in: Claude can provide natural language justifications for each classification, making it ideal for regulated industries like insurance and finance.