BeClaude

markitdown-reader

New
1GitHub TrendingGeneralby kiryafn

Use when asked to read, analyze, or reference a binary or token-heavy file (PDF, DOCX, PPTX, XLSX, XLS, HTML, ZIP, audio, Outlook MSG, YouTube transcript). Convert to markdown first with markitdown CLI to save tokens and improve parsing quality.

Community PluginView Source

Overview

Markitdown Reader

Convert binary and token-heavy files to markdown before reading. Uses the markitdown CLI to produce clean .md output that is cheaper and easier to parse than raw binary formats.

Core principle: Never read a binary or large-format file directly when a lightweight markdown conversion exists.

When to Use

  • User provides or references a PDF, DOCX, PPTX, XLSX, XLS, HTML, ZIP, MSG, audio file, or YouTube link
  • File is too large or token-expensive to read natively
  • Format cannot be read natively (PPTX, DOCX, XLS, MSG)
  • User asks to read/summarize/analyze a document in a supported format

Do NOT use when:

  • File is already plain text, markdown, or source code
  • File is a small image (use native image reading)
  • User explicitly asks for raw/native reading
  • PDF is under 5 pages (native PDF reading is fine for small files)

Supported Formats

FormatExtensionNotes
PDF.pdfBest for large/complex PDFs
Word.docx
PowerPoint.pptxExtracts slide text and notes
Excel.xlsx, .xlsConverts tables to markdown
HTML.html, .htmStrips tags, keeps structure
ZIP.zipExtracts and converts contents
Outlook.msgHeaders + body + attachment names
Audio.wav, .mp3Requires ffmpeg
YouTubeURLRequires transcript availability

Procedure

code
1. Check markitdown installed  →  which markitdown || pip install markitdown[all]
2. Convert to markdown          →  markitdown "<path>" > /tmp/<name>.md 2>/tmp/<name>_err.log
3. Validate output              →  wc -l /tmp/<name>.md (must be > 0 lines)
4. Check output size            →  wc -c /tmp/<name>.md (if > 500KB, read with offset/limit)
5. Read converted file          →  Read tool on /tmp/<name>.md
6. On failure                   →  check /tmp/<name>_err.log, see Error Handling below

Conversion Command

bash
# Always redirect stderr to catch errors
markitdown "/path/to/file.pdf" > /tmp/file.md 2>/tmp/file_err.log

# Check conversion succeeded
if [ ! -s /tmp/file.md ]; then
  cat /tmp/file_err.log
fi

Large Output Handling

If converted markdown exceeds ~500KB or ~10,000 lines:

  • Use Read with offset and limit to read in chunks
  • Summarize sections as you go, don't load everything into context
  • Tell user about file size before reading: "Document converted to X lines of markdown. Reading relevant sections."

Quick Reference

ScenarioCommand
PDF → MDmarkitdown doc.pdf > /tmp/doc.md 2>/tmp/doc_err.log
DOCX → MDmarkitdown doc.docx > /tmp/doc.md 2>/tmp/doc_err.log
PPTX → MDmarkitdown slides.pptx > /tmp/slides.md 2>/tmp/slides_err.log
XLSX → MDmarkitdown data.xlsx > /tmp/data.md 2>/tmp/data_err.log
Audio → MDmarkitdown audio.mp3 > /tmp/audio.md 2>/tmp/audio_err.log
YouTube → MDmarkitdown "https://youtube.com/watch?v=ID" > /tmp/yt.md 2>/tmp/yt_err.log
HTML → MDmarkitdown page.html > /tmp/page.md 2>/tmp/page_err.log

Error Handling

ErrorCauseFix
command not foundNot installedpip install markitdown[all]
ModuleNotFoundErrorMissing format extrapip install markitdown[pdf] (match format)
Empty output fileCorrupted/encrypted file, scanned PDFTell user file cannot be parsed; suggest OCR for scanned PDFs
UnicodeDecodeErrorEncoding mismatchTry: markitdown --charset utf-8 <file> or convert encoding first
ffmpeg not foundAudio without ffmpegbrew install ffmpeg (macOS) / apt install ffmpeg (Linux)
Timeout / hangHuge file (>100MB)Split file first, or extract specific pages: pdftk input.pdf cat 1-20 output chunk.pdf
YouTube failsNo transcript availableInform user — not all videos have transcripts; no API key fixes this
Garbled table outputComplex merged cells in ExcelFall back to reading with openpyxl directly in Python

Fallback Strategy

If markitdown fails and no fix works:

  1. PDF → try native Read tool (works for small PDFs up to ~10 pages)
  2. XLSX → try python3 -c "import openpyxl; ..." to extract specific sheets
  3. DOCX → try python3 -c "import docx; ..." to extract raw text
  4. Audio → suggest user provide a transcript manually
  5. Always inform user of the failure and what you tried

Common Mistakes

MistakeFix
Reading large PDF nativelyConvert first — markdown is cheaper
Ignoring stderrAlways capture 2>err.log — silent failures waste time
Writing temp files to project dirUse /tmp/ to avoid polluting workspace
Not validating outputCheck wc -l — empty output = failed conversion
Using on plain text / source codeSkip — already token-efficient
Loading entire huge output into contextUse offset/limit on Read for files >500KB
Assuming YouTube always worksTranscripts depend on video settings, not on your setup

Installation

bash
# All formats (recommended)
pip install markitdown[all]

# Audio also needs ffmpeg
brew install ffmpeg  # macOS

Install & Usage

1
Create the skills directory
mkdir -p .claude/skills
2
Download the skill file
mkdir -p .claude/skills && curl -o .claude/skills/markitdown-reader.md https://raw.githubusercontent.com/kiryafn/markitdown-skill/main/SKILL.md
3
Invoke in Claude Code
/markitdown-reader
View source on GitHub

Frequently Asked Questions

What is markitdown-reader?

Use when asked to read, analyze, or reference a binary or token-heavy file (PDF, DOCX, PPTX, XLSX, XLS, HTML, ZIP, audio, Outlook MSG, YouTube transcript). Convert to markdown first with markitdown CLI to save tokens and improve parsing quality.

How to install markitdown-reader?

To install markitdown-reader, create the .claude/skills directory in your project, then run the curl command to download the skill file. Once installed, invoke it in Claude Code with /markitdown-reader.

What is markitdown-reader best for?

markitdown-reader is a community categorized under General. Created by kiryafn.