Guide2026-04-18

Build a Knowledge Graph from Unstructured Documents Using Claude AI

Learn how to extract entities and relationships from documents using Claude's structured outputs, resolve entity variants, and build a queryable knowledge graph for multi-hop reasoning.

Quick Answer

This guide teaches you to transform unstructured documents into a knowledge graph using Claude AI. You'll learn to extract typed entities and relationships with structured outputs, resolve entity variants across documents, assemble a graph, and query it for multi-hop insights—all without training data.

knowledge-graphentity-extractionclaude-apiragdata-pipelines

Build a Knowledge Graph from Unstructured Documents Using Claude AI

When you need to answer complex questions that span multiple documents—like "which vendors are connected to this incident?" or "who works with people who worked on project X?"—traditional retrieval-augmented generation (RAG) often falls short. RAG retrieves relevant chunks but doesn't connect facts across documents. The solution is a knowledge graph: entities as nodes and typed relationships as edges, enabling multi-hop reasoning through graph traversal.

Building knowledge graphs traditionally required training named-entity recognizers, relation classifiers, and writing entity-resolution heuristics. With Claude AI, each stage becomes a prompt. This guide shows you how to extract structured knowledge from unstructured text, resolve entity variants, and build a queryable graph—all without labeled training data.

Prerequisites

Python 3.11+
Anthropic API key (get one here)
Basic familiarity with graphs (nodes, edges, traversal)
Anthropic Python SDK: pip install anthropic

The Core Idea: From Documents to Graph

The process follows three main stages:

Extraction: Pull entities and relationships from each document
Resolution: Merge different mentions of the same real-world entity
Assembly: Build and query a graph from the cleaned data

We'll use the Apollo program as our test corpus—six Wikipedia summaries that mention NASA, astronauts, and missions with varying terminology.

Step 1: Entity and Relation Extraction with Structured Outputs

Traditional NLP pipelines separate named entity recognition (NER) and relation extraction into different trained models. With Claude, we combine both into a single API call using structured outputs.

First, define your schema using Pydantic models:

from pydantic import BaseModel, Field
from typing import List
class Entity(BaseModel):
    """An entity mentioned in the text"""
    name: str = Field(description="The entity name as mentioned")
    type: str = Field(description="Entity type: PERSON, ORGANIZATION, LOCATION, etc.")
    description: str = Field(description="One-line description from context")
class Relation(BaseModel):
    """A subject-predicate-object triple"""
    subject: str = Field(description="Subject entity name")
    predicate: str = Field(description="Relationship type")
    object: str = Field(description="Object entity name")
class ExtractionResult(BaseModel):
    """Structured extraction from a document"""
    entities: List[Entity]
    relations: List[Relation]

Now extract from a document with a single Claude call:

import anthropic
from anthropic.types import Message
client = anthropic.Anthropic(api_key="your-api-key")
def extract_knowledge(text: str) -> ExtractionResult:
    """Extract entities and relations from text using Claude"""
    prompt = f"""Extract all entities and relationships from the following text.
    For each entity, provide its name, type, and a brief description based on context.
    For relationships, provide subject-predicate-object triples.
    
    Text: {text}
    """
    
    message = client.messages.create(
        model="claude-3-haiku-20240307",  # Fast and cost-effective for extraction
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}],
    )
    
    # Parse the response into our structured format
    result = client.messages.parse(
        model="claude-3-sonnet-20240229",
        messages=[{"role": "user", "content": prompt}],
        response=message.content[0].text,
        extraction_model=ExtractionResult
    )
    
    return result.extracted

Why this works:

Structured outputs guarantee valid JSON that matches your schema
No training data needed—Claude understands your domain from the prompt
Single pass extraction captures both entities and their relationships

Step 2: Entity Resolution with Claude

Raw extraction gives you overlapping mentions: "NASA" and "National Aeronautics and Space Administration," "Neil Armstrong" and "Armstrong." Building a graph directly creates a fractured mess where the same entity appears as multiple nodes.

Traditional resolution uses string similarity (edit distance, Jaccard similarity), which fails on cases like "Edwin Aldrin" vs "Buzz Aldrin"—different names for the same person.

Instead, ask Claude to cluster entities by type, using descriptions as context:

class EntityCluster(BaseModel):
    """A cluster of entity mentions that refer to the same real-world entity"""
    canonical_name: str = Field(description="The canonical name for this entity")
    entity_type: str = Field(description="PERSON, ORGANIZATION, etc.")
    mentions: List[str] = Field(description="All surface forms that refer to this entity")
    description: str = Field(description="Consolidated description")
class ResolutionResult(BaseModel):
    """Result of entity resolution across documents"""
    clusters: List[EntityCluster]
def resolve_entities(all_entities: List[Entity]) -> ResolutionResult:
    """Cluster entity mentions into canonical entities"""
    # Prepare context for Claude
    entity_list = "\n".join([
        f"- {e.name} ({e.type}): {e.description}" 
        for e in all_entities
    ])
    
    prompt = f"""Cluster these entities so that each cluster contains different mentions 
    of the same real-world entity. Use the descriptions to disambiguate.
    
    Entities:
    {entity_list}
    
    Return clusters with a canonical name and all surface forms."""
    
    message = client.messages.create(
        model="claude-3-sonnet-20240229",  # Use Sonnet for nuanced reasoning
        max_tokens=2000,
        messages=[{"role": "user", "content": prompt}],
    )
    
    result = client.messages.parse(
        model="claude-3-sonnet-20240229",
        messages=[{"role": "user", "content": prompt}],
        response=message.content[0].text,
        extraction_model=ResolutionResult
    )
    
    return result.extracted

Watch for these failure modes:

Under-merging: Claude leaves some mentions out of clusters, causing data loss
**Over-merging": Specific entities ("Gemini 12") get folded into broader ones ("Project Gemini")

Production systems should include a fallback: any mention not clustered becomes its own single-element cluster.

Step 3: Assembling and Querying the Graph

With resolved entities, create a mapping from surface forms to canonical names, then rewrite all relationship endpoints:

import networkx as nx
from collections import defaultdict
def build_knowledge_graph(
    extractions: List[ExtractionResult], 
    resolution: ResolutionResult
) -> nx.MultiDiGraph:
    """Build a directed multigraph from extracted and resolved data"""
    
    # Create alias mapping
    alias_to_canonical = {}
    for cluster in resolution.clusters:
        for mention in cluster.mentions:
            alias_to_canonical[mention] = cluster.canonical_name
    
    # Initialize graph
    G = nx.MultiDiGraph()
    
    # Add nodes with attributes
    for cluster in resolution.clusters:
        G.add_node(
            cluster.canonical_name,
            type=cluster.entity_type,
            description=cluster.description
        )
    
    # Add edges (relationships)
    for extraction in extractions:
        for rel in extraction.relations:
            # Map subject and object to canonical names
            subj_canonical = alias_to_canonical.get(rel.subject, rel.subject)
            obj_canonical = alias_to_canonical.get(rel.object, rel.object)
            
            # Add edge if both endpoints exist in graph
            if subj_canonical in G and obj_canonical in G:
                G.add_edge(
                    subj_canonical,
                    obj_canonical,
                    predicate=rel.predicate,
                    source_doc=extraction.metadata.get("doc_id", "unknown")
                )
    
    return G

We use a MultiDiGraph because:

Two entities can have multiple relationship types
Direction matters ("A commands B" ≠ "B commands A")

Step 4: Multi-Hop Querying

Simple graph queries can be answered with NetworkX traversals. Complex questions require Claude's reasoning:

def answer_graph_question(G: nx.MultiDiGraph, question: str) -> str:
    """Answer a question by extracting relevant subgraph and querying Claude"""
    
    # First, extract entities mentioned in the question
    question_entities = extract_knowledge(question).entities
    entity_names = [e.name for e in question_entities]
    
    # Find relevant subgraph (2-hop neighborhood)
    relevant_nodes = set()
    for entity in entity_names:
        if entity in G:
            # Add entity and its neighbors within 2 hops
            relevant_nodes.add(entity)
            relevant_nodes.update(G.neighbors(entity))
            for neighbor in G.neighbors(entity):
                relevant_nodes.update(G.neighbors(neighbor))
    
    # Extract subgraph and serialize to text
    subgraph = G.subgraph(relevant_nodes)
    
    # Convert to readable format
    graph_text = "Knowledge Graph Subset:\n"
    for node in subgraph.nodes():
        graph_text += f"\n{node} ({subgraph.nodes[node]['type']}): {subgraph.nodes[node].get('description', '')}"
        
    for u, v, data in subgraph.edges(data=True):
        graph_text += f"\n  - {data['predicate']} → {v}"
    
    # Ask Claude to answer using the subgraph
    prompt = f"""Using the following knowledge graph, answer this question: {question}
    
    {graph_text}
    
    Answer concisely based only on the graph above."""
    
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text

Model Selection: Haiku vs Sonnet

Choose models based on task requirements:

Haiku (claude-3-haiku-20240307): Use for high-volume extraction where speed and cost matter. It handles schema-constrained extraction well.
Sonnet (claude-3-sonnet-20240229): Use for entity resolution and complex reasoning where nuance matters. It weighs conflicting evidence better.

Cost-quality tradeoff: For a 1,000-token document, Haiku extraction costs ~$0.00025 vs Sonnet's ~$0.003. Resolution across 100 entities costs ~$0.015 with Sonnet. Balance volume tasks with Haiku and reasoning tasks with Sonnet.

Production Considerations

Scalability: This in-memory approach works for thousands of documents. For larger datasets, export to Neo4j, AWS Neptune, or Postgres with adjacency tables.
Incremental updates: When new documents arrive, extract and resolve entities against existing canonical names.
Validation: Create a gold set of 50-100 entity pairs and relations to measure precision/recall. Spot-check resolution clusters.
Error handling: Implement fallbacks for failed extractions (retry with different prompt) and unclustered entities.

Key Takeaways

Structured outputs eliminate training data needs: Define Pydantic models for entities and relations, and Claude extracts them from any domain without labeled examples.

Claude-driven entity resolution beats string matching: By using contextual descriptions, Claude correctly clusters "Edwin Aldrin" and "Buzz Aldrin" while keeping different "Armstrongs" separate.

Multi-hop reasoning becomes graph traversal: Once you have a clean knowledge graph, answering complex questions across documents reduces to finding paths between nodes.

Model choice affects cost and quality: Use Haiku for high-volume extraction and Sonnet for nuanced resolution and reasoning, optimizing the cost-quality tradeoff.

The pipeline transfers to production graph databases: Start with in-memory NetworkX for prototyping, then migrate to Neo4j, Neptune, or Postgres when you need scalability and persistence.

This approach transforms document collections into connected knowledge that answers questions no single document could address—all powered by Claude's understanding and reasoning capabilities.