BeClaude
GuideBeginnerBest Practices2026-05-15

Building a High-Accuracy Classification System with Claude: From 70% to 95%+ Accuracy

Learn how to build a production-ready classification system using Claude, prompt engineering, and RAG. This guide walks through improving accuracy from 70% to 95%+ with practical code examples.

Quick Answer

This guide teaches you to build a high-accuracy classification system with Claude by combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning. You'll learn to improve accuracy from 70% to 95%+ using practical Python code examples.

classificationprompt-engineeringRAGchain-of-thoughtAnthropic API

Building a High-Accuracy Classification System with Claude: From 70% to 95%+ Accuracy

Classification is a cornerstone of many business applications, from routing support tickets to moderating content. Traditional machine learning approaches often struggle with complex business rules, limited training data, and the need for explainable results. Large Language Models (LLMs) like Claude offer a powerful alternative.

In this guide, you'll build a production-ready classification system that categorizes insurance support tickets into 10 categories. You'll learn how to progressively improve classification accuracy from a baseline of 70% to over 95% by combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning.

Prerequisites

  • Python 3.11+ with basic familiarity
  • An Anthropic API key (required)
  • A VoyageAI API key (optional — embeddings are pre-computed in the cookbook)
  • Basic understanding of classification problems

Setup

First, install the required packages:

pip install anthropic voyageai pandas matplotlib scikit-learn numpy

Next, set up your API keys and model configuration:

import os
from anthropic import Anthropic

Load API keys from environment variables

anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY") client = Anthropic(api_key=anthropic_api_key)

Set your model

MODEL_NAME = "claude-3-opus-20240229"

Why Use LLMs for Classification?

Traditional machine learning classifiers require large amounts of labeled data, extensive feature engineering, and often produce black-box results. LLMs like Claude excel in scenarios where:

  • Complex business rules need to be interpreted and applied
  • Training data is limited or low-quality
  • Explainability is required — Claude can provide natural language justifications for its decisions
  • Categories evolve frequently and need quick updates

Step 1: Data Preparation

Proper data preparation is crucial. You'll need:

  • Training data: Used to build the classification model (via examples in prompts)
  • Test data: Used to evaluate performance
For this insurance ticket classifier, the data includes 10 categories:
  • Billing Inquiries — Questions about invoices, charges, fees, and premiums
  • Policy Administration — Requests for policy changes, updates, or cancellations
  • Claims Assistance — Questions about the claims process and filing procedures
  • Coverage Explanations — Questions about what is covered under specific policy types
  • Account Management — Login issues, profile updates, and account access
  • Underwriting Questions — Risk assessment, policy issuance, and eligibility
  • Fraud and Compliance — Reporting suspicious activity or compliance concerns
  • Agent and Broker Support — Assistance for agents and brokers
  • Product and Service Feedback — Complaints, suggestions, and testimonials
  • General Inquiries — Miscellaneous questions not covered by other categories
Load your data into a pandas DataFrame:
import pandas as pd

Load training and test data

train_df = pd.read_csv('insurance_tickets_train.csv') test_df = pd.read_csv('insurance_tickets_test.csv')

print(f"Training samples: {len(train_df)}") print(f"Test samples: {len(test_df)}")

Step 2: Prompt Engineering

Prompt engineering is the foundation of LLM-based classification. A well-crafted prompt includes:

  • System instructions: Define the task and output format
  • Category definitions: Clear descriptions of each class
  • Examples: Few-shot examples to guide the model
  • User query: The ticket to classify
Here's a basic prompt template:
SYSTEM_PROMPT = """You are an insurance support ticket classifier. Your task is to classify each ticket into exactly one of the following categories:
  • Billing Inquiries
  • Policy Administration
  • Claims Assistance
  • Coverage Explanations
  • Account Management
  • Underwriting Questions
  • Fraud and Compliance
  • Agent and Broker Support
  • Product and Service Feedback
  • General Inquiries
Respond with only the category number and name, nothing else."""

def classify_ticket(ticket_text): response = client.messages.create( model=MODEL_NAME, max_tokens=100, system=SYSTEM_PROMPT, messages=[ {"role": "user", "content": f"Classify this ticket: {ticket_text}"} ] ) return response.content[0].text

This baseline approach typically achieves around 70% accuracy. Let's improve it.

Step 3: Implementing Retrieval-Augmented Generation (RAG)

RAG dramatically improves accuracy by providing Claude with relevant examples from your training data. The idea is simple: for each new ticket, find the most similar tickets from your training set and include them as few-shot examples in the prompt.

Create a Vector Database

First, generate embeddings for your training data:

import voyageai

vo = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"))

Generate embeddings for training data

train_texts = train_df['ticket_text'].tolist() train_embeddings = vo.embed(train_texts, model="voyage-2").embeddings

Implement Similarity Search

When a new ticket comes in, find the most similar training examples:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def find_similar_tickets(query, k=3): # Embed the query query_embedding = vo.embed([query], model="voyage-2").embeddings[0] # Calculate similarities similarities = cosine_similarity([query_embedding], train_embeddings)[0] # Get top-k indices top_indices = np.argsort(similarities)[-k:][::-1] return train_df.iloc[top_indices]

Augment the Prompt

Now, include these similar examples in your prompt:

def classify_with_rag(ticket_text):
    # Find similar tickets
    similar = find_similar_tickets(ticket_text, k=3)
    
    # Build examples string
    examples = ""
    for _, row in similar.iterrows():
        examples += f"Ticket: {row['ticket_text']}\nCategory: {row['category']}\n\n"
    
    prompt = f"""Here are examples of classified tickets:

{examples}

Now classify this ticket: Ticket: {ticket_text} Category:""" response = client.messages.create( model=MODEL_NAME, max_tokens=100, system=SYSTEM_PROMPT, messages=[ {"role": "user", "content": prompt} ] ) return response.content[0].text

This RAG approach typically boosts accuracy to 85-90%.

Step 4: Adding Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting asks Claude to reason step-by-step before giving the final answer. This is particularly useful for ambiguous tickets that could fit multiple categories.

def classify_with_cot(ticket_text):
    similar = find_similar_tickets(ticket_text, k=3)
    
    examples = ""
    for _, row in similar.iterrows():
        examples += f"Ticket: {row['ticket_text']}\nCategory: {row['category']}\n\n"
    
    prompt = f"""Here are examples of classified tickets:

{examples}

Now classify this ticket. First, reason step-by-step about which category fits best. Then provide your final answer on a new line starting with 'Category:'.

Ticket: {ticket_text}

Reasoning:""" response = client.messages.create( model=MODEL_NAME, max_tokens=300, system=SYSTEM_PROMPT, messages=[ {"role": "user", "content": prompt} ] ) return response.content[0].text

Combining RAG with chain-of-thought reasoning pushes accuracy to 95%+.

Step 5: Testing and Evaluation

To evaluate your classifier, run it against your test set and compare predictions to ground truth:

from sklearn.metrics import accuracy_score, classification_report

def evaluate_classifier(classify_func, test_df): predictions = [] for _, row in test_df.iterrows(): pred = classify_func(row['ticket_text']) predictions.append(extract_category(pred)) # Helper to parse response accuracy = accuracy_score(test_df['category'], predictions) print(f"Accuracy: {accuracy:.2%}") print(classification_report(test_df['category'], predictions)) return accuracy

Putting It All Together

Here's the complete pipeline:

def final_classifier(ticket_text):
    """
    High-accuracy classifier combining RAG and chain-of-thought.
    """
    # Step 1: Find similar examples
    similar = find_similar_tickets(ticket_text, k=5)
    
    # Step 2: Build prompt with examples and CoT instructions
    examples = "\n\n".join([
        f"Ticket: {row['ticket_text']}\nCategory: {row['category']}"
        for _, row in similar.iterrows()
    ])
    
    prompt = f"""You are an expert insurance ticket classifier.

Category definitions:

  • Billing Inquiries: Questions about invoices, charges, fees, and premiums
  • Policy Administration: Requests for policy changes, updates, or cancellations
  • Claims Assistance: Questions about the claims process and filing procedures
... (all 10 categories)

Relevant examples: {examples}

Classify the following ticket. Think step-by-step:

Ticket: {ticket_text}

Reasoning:""" # Step 3: Get response from Claude response = client.messages.create( model=MODEL_NAME, max_tokens=300, messages=[ {"role": "user", "content": prompt} ] ) return response.content[0].text

Key Takeaways

  • Prompt engineering is the foundation: Start with clear category definitions and output formatting instructions. This alone can achieve ~70% accuracy.
  • RAG dramatically improves accuracy: By retrieving and including similar examples from your training data, you can boost accuracy to 85-90% without retraining.
  • Chain-of-thought reasoning adds the final edge: Asking Claude to reason step-by-step before outputting the final category pushes accuracy to 95%+ and provides explainable results.
  • This approach works with limited data: Unlike traditional ML classifiers that require thousands of labeled examples, this method works well with just dozens or hundreds of examples.
  • Explainability is built-in: Claude can provide natural language justifications for each classification, making it ideal for regulated industries like insurance and finance.