Building a High-Accuracy Classification System with Claude: From 70% to 95%+ Accuracy
Learn to build a production-ready classification system using Claude, prompt engineering, and RAG. Improve accuracy from 70% to 95%+ with practical Python examples.
Build a high-accuracy classification system using Claude by combining prompt engineering, retrieval-augmented generation (RAG), and chain-of-thought reasoning. This guide walks through improving accuracy from 70% to 95%+ using an insurance support ticket classifier example.
Building a High-Accuracy Classification System with Claude: From 70% to 95%+ Accuracy
Classification is one of the most practical applications of Large Language Models (LLMs) in business. Whether you're routing support tickets, categorizing customer feedback, or flagging compliance issues, getting classification right is critical. Traditional machine learning approaches often struggle with complex business rules, limited training data, and the need for explainable results.
In this guide, you'll learn how to build a production-ready classification system using Claude that achieves 95%+ accuracy. We'll use an insurance support ticket classifier as our example, but the techniques apply broadly to any classification problem.
Why LLMs for Classification?
Traditional ML classifiers require:
- Large amounts of labeled training data
- Extensive feature engineering
- Retraining when business rules change
- Separate explainability tools
- Working effectively with limited examples (few-shot learning)
- Understanding natural language business rules directly
- Providing built-in explanations for every classification
- Adapting instantly to new categories via prompt changes
Prerequisites
Before diving in, make sure you have:
- Python 3.11+ installed
- An Anthropic API key
- Basic familiarity with Python and classification concepts
Step 1: Setting Up Your Environment
First, install the required packages:
pip install anthropic voyageai pandas matplotlib scikit-learn numpy
Next, set up your API keys and initialize the Claude client:
import os
from anthropic import Anthropic
Load API keys from environment variables
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
Initialize Claude client
client = Anthropic(api_key=anthropic_api_key)
MODEL_NAME = "claude-3-opus-20240229"
Step 2: Understanding the Problem
We'll build a classifier for insurance support tickets with 10 categories:
- Billing Inquiries - Questions about invoices, charges, fees
- Policy Administration - Policy changes, cancellations, renewals
- Claims Assistance - Claims process, documentation, status
- Coverage Explanations - What's covered, limits, exclusions
- Account Management - Login issues, profile updates
- Fraud and Compliance - Suspicious activity, regulatory questions
- Agent and Broker Support - Commission questions, agent tools
- Product and Service Inquiries - New products, quotes, comparisons
- Technical Support - Website/app issues, system errors
- General Inquiries - Miscellaneous questions
Step 3: Basic Prompt Engineering (70% Accuracy)
Let's start with a simple approach: asking Claude to classify based on category definitions alone.
def classify_ticket_basic(ticket_text: str) -> str:
prompt = f"""You are an insurance support ticket classifier.
Classify the following ticket into exactly one category.
Categories:
- Billing Inquiries
- Policy Administration
- Claims Assistance
- Coverage Explanations
- Account Management
- Fraud and Compliance
- Agent and Broker Support
- Product and Service Inquiries
- Technical Support
- General Inquiries
Ticket: {ticket_text}
Respond with only the category number and name."""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Result: ~70% accuracy. The model understands the categories but struggles with edge cases and ambiguous tickets.
Step 4: Adding Few-Shot Examples (80% Accuracy)
Providing examples dramatically improves performance. Let's add 2-3 examples per category:
def classify_ticket_few_shot(ticket_text: str) -> str:
examples = """
Example 1:
Ticket: "Why was I charged $150 for a policy change fee?"
Category: 1. Billing Inquiries
Example 2:
Ticket: "I need to add my spouse to my auto policy"
Category: 2. Policy Administration
Example 3:
Ticket: "How do I file a claim for hail damage?"
Category: 3. Claims Assistance
"""
prompt = f"""You are an insurance support ticket classifier.
Here are examples of classified tickets:
{examples}
Now classify this ticket:
Ticket: {ticket_text}
Respond with only the category number and name."""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Result: ~80% accuracy. Examples help, but we're limited by prompt length and need better example selection.
Step 5: Implementing Retrieval-Augmented Generation (RAG) (90% Accuracy)
Instead of manually selecting examples, use a vector database to retrieve the most relevant examples for each query. This is where RAG shines.
import voyageai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
Initialize VoyageAI for embeddings
vo = voyageai.Client(api_key=os.environ.get("VOYAGE_API_KEY"))
Create embeddings for your training data
def get_embeddings(texts):
result = vo.embed(texts, model="voyage-2")
return result.embeddings
Store training examples with their embeddings
training_data = [
{"text": "Why was I charged a late fee?", "category": "Billing Inquiries"},
{"text": "I need to cancel my policy", "category": "Policy Administration"},
# ... more examples
]
Pre-compute embeddings
training_embeddings = get_embeddings([ex["text"] for ex in training_data])
def retrieve_similar_examples(query: str, k: int = 3):
query_embedding = get_embeddings([query])[0]
similarities = cosine_similarity([query_embedding], training_embeddings)[0]
top_indices = np.argsort(similarities)[-k:][::-1]
return [training_data[i] for i in top_indices]
def classify_ticket_rag(ticket_text: str) -> str:
# Retrieve most similar examples
similar_examples = retrieve_similar_examples(ticket_text)
# Build prompt with retrieved examples
examples_text = "\n".join([
f"Ticket: {ex['text']}\nCategory: {ex['category']}"
for ex in similar_examples
])
prompt = f"""Classify this insurance support ticket.
Relevant examples:
{examples_text}
Ticket to classify: {ticket_text}
Category:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Result: ~90% accuracy. RAG ensures you always show the most relevant examples for each query.
Step 6: Adding Chain-of-Thought Reasoning (95%+ Accuracy)
Finally, ask Claude to reason step-by-step before giving the final classification. This dramatically improves accuracy on ambiguous cases.
def classify_ticket_cot(ticket_text: str) -> dict:
# Retrieve similar examples
similar_examples = retrieve_similar_examples(ticket_text)
prompt = f"""Classify this insurance support ticket. First, reason step-by-step, then provide the final category.
Relevant examples:
{"\n".join([f"- {ex['text']} -> {ex['category']}" for ex in similar_examples])}
Ticket: {ticket_text}
Let's think step by step:
- What is the main topic of this ticket?
- What specific action or information is being requested?
- Which category best matches this combination?
Reasoning:"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=300,
messages=[{"role": "user", "content": prompt}]
)
return {
"reasoning": response.content[0].text,
"category": extract_category(response.content[0].text)
}
Result: 95%+ accuracy. Chain-of-thought reasoning helps Claude handle edge cases and provides transparent, auditable classifications.
Step 7: Testing and Evaluation
Here's how to evaluate your classifier systematically:
def evaluate_classifier(test_data, classifier_fn):
correct = 0
total = len(test_data)
for item in test_data:
predicted = classifier_fn(item["text"])
if predicted.strip() == item["category"]:
correct += 1
accuracy = correct / total
print(f"Accuracy: {accuracy:.2%}")
return accuracy
Load test data (synthetic or real)
test_data = [
{"text": "My premium went up 20%, why?", "category": "Billing Inquiries"},
{"text": "How do I reinstate my lapsed policy?", "category": "Policy Administration"},
# ... more test cases
]
Test each approach
print("Basic:", evaluate_classifier(test_data, classify_ticket_basic))
print("Few-shot:", evaluate_classifier(test_data, classify_ticket_few_shot))
print("RAG:", evaluate_classifier(test_data, classify_ticket_rag))
Best Practices for Production
- Start simple, iterate fast - Begin with basic prompting, then add complexity as needed
- Use consistent category definitions - Clear, unambiguous definitions prevent confusion
- Balance your examples - Ensure each category has similar representation
- Monitor confidence - Track when Claude is uncertain and flag those cases for human review
- Version your prompts - Small changes can have big impacts; track everything
Key Takeaways
- Progressive improvement works - Start with basic prompting (70%), add few-shot examples (80%), implement RAG (90%), and finish with chain-of-thought reasoning (95%+)
- RAG eliminates the need for massive training data - By retrieving relevant examples dynamically, you can achieve high accuracy with limited labeled data
- Chain-of-thought reasoning provides transparency - Claude's step-by-step reasoning makes classifications auditable and helps debug edge cases
- The same techniques apply across domains - Whether classifying insurance tickets, customer feedback, or compliance documents, these methods transfer directly
- Production systems need monitoring - Even at 95% accuracy, you need processes for handling uncertain classifications and tracking performance over time