Skip to content
BeClaude

watch

New
GitHub TrendingGeneralby alexander-rehm

Watch a video (URL or local path) like an editor. Extracts scene-change frames, pacing metrics (cuts/min, shot length), and a dense 0-10s hook microscope; pulls the transcript from native captions (captions-only by default, so no audio leaves the machine without consent). Produces a structured `report.md` and, after answering the user, saves a finished note (report.md plus hero frames) into your Obsidian vault under `Video notes/` (vault configurable via `$WATCH_VAULT_DIR`) - tied to *why* the user watched it.

Summary

This skill equips Claude with the ability to watch a video by downloading it, extracting frames at scene changes, capturing native captions (or Whisper transcript with consent), and computing editorial pacing metrics like cuts per minute and mean shot length.

  • It produces a structured report with a TL;DR, hook analysis, key moments, and quotable quotes, then saves the final note into your Obsidian vault for easy reference.

Overview

/watch - Claude watches a video

You don't have a video input; this skill gives you one. A Python script downloads the video, extracts frames as JPEGs (one per detected shot via scene-change), gets a timestamped transcript (native captions by default; Whisper only on explicit consent), runs editorial pacing metrics, and microscopes the first 10 seconds at higher density. You then Read each frame path to see the images, combine them with the transcript to answer the user, fill the structured report.md, and save the finished note into the user's Obsidian vault.

What v2 does differently

  • Scene-change frame sampling - one frame per detected shot instead of uniform ticks. Cuts the frame budget on long videos while capturing every transition.
  • Editorial pacing metrics - cuts/min, mean shot length, motion (when available). Lets you reason about pacing the way an editor does.
  • Hook microscope - first 10s auto-runs at 2 fps + word-level Whisper. The single most leveraged 10 seconds of any video deserves dense treatment.
  • Structured `report.md` - every watch emits a structured report at <workdir>/report.md with TL;DR, key moments, hook breakdown, editorial profile, quotable moments, entities, concepts, and transcript. Narrative sections are emitted as <!-- pending Claude fill: ... --> markers - you fill them in before saving.
  • Step 4.4 - Save to vault - after answering the user, save the finished note (report.md plus hero frames, with inline ![[ ]] embeds) into the user's Obsidian vault under Video notes/<slug>/. Report-only: the downloaded video and full frame set stay in temp and are deleted.

None of the above add new dependencies - pure ffmpeg + stdlib + the existing Whisper backend.

Configuration - where to save notes

Step 4.4 saves the finished note inside an Obsidian vault (or any folder the user picks). Resolve the save directory in this order, first hit wins; the result is what $VAULT_DIR refers to everywhere below. There is NO hardcoded path: the location is always the user's own.

  1. `$WATCH_VAULT_DIR` env var - if set and the path exists, use it. The user-controlled override.
  2. `WATCH_VAULT_DIR=...` line in `~/.config/watch/.env` - if present and the path exists. This is where a first-run choice (below) is persisted.
  3. Auto-detect a common Obsidian vault - first of ~/Documents/Obsidian/, ~/Obsidian/, ~/Second brain/ that exists.
  4. Nothing configured yet (first run): do NOT guess and do NOT skip silently. At the save step, ask the user once via AskUserQuestion where to save: "Where should I save video notes? Give an Obsidian vault (or any folder) path, or choose to skip saving." If they give a path, persist it by appending WATCH_VAULT_DIR=<path> to ~/.config/watch/.env (create the file if needed) so you never ask again, then use it. If they choose skip, print Report (not saved): <workdir>/report.md in chat and move on.

Finished notes are saved under <vault>/Video notes/<slug>/: one self-contained folder per video, holding the filled note plus its hero-frame JPEGs.

Step 0 - Setup preflight (runs every /watch invocation, silent on success)

Python interpreter: every python3 ... command in this skill is for macOS/Linux. On Windows, substitute python - the python3 command on Windows is the Microsoft Store stub and will not run the script.

Before every /watch run, verify that dependencies and an API key are in place:

bash
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --check

This is a <100ms lookup. On exit 0, the script emits nothing - proceed to Step 1 without comment. Do NOT announce "setup is complete" to the user - they don't need a status message on every turn. The only acceptable user-visible output from Step 0 is when remediation is required.

On non-zero exit, follow the table:

ExitMeaningAction
2Missing binaries (ffmpeg / ffprobe / yt-dlp)Run installer
3No Whisper API keyRun installer to scaffold .env, then ask user for a key
4Both missingRun installer, then ask for a key

The installer is idempotent - safe to re-run:

bash
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py"

On macOS with Homebrew, it auto-installs ffmpeg and yt-dlp. On Linux/Windows, it prints the exact install commands for the user to run. It scaffolds ~/.config/watch/.env with commented placeholders at 0600 perms, and writes SETUP_COMPLETE=true once deps + a key are in place so the next session knows this user has already been through the wizard.

If an API key is missing: that is fine and expected - the default posture is --captions-only, which needs no Whisper key. Do NOT prompt for a key on a normal run. Only the binary check matters: exit 2/4 (ffmpeg / yt-dlp missing) means run the installer; treat a missing key (exit 3) as ready and proceed captions-only. Only when the user explicitly consents to transcribing a caption-less video (the consent gate) should you offer, via AskUserQuestion, to add a Groq (preferred) or OpenAI key to ~/.config/watch/.env.

Within a single session, you can skip Step 0 on follow-up /watch calls - once --check returned 0, nothing about the environment changes between turns.

When to use

  • User pastes a video URL (YouTube, Vimeo, X, TikTok, Twitch clip, most yt-dlp-supported sites) and asks about it.
  • User points at a local video file (.mp4, .mov, .mkv, .webm, etc.) and asks about it.
  • User types /watch <url-or-path> [question].

Recommended limits

  • Default budget: lean (`--max-frames 24`). Reading frames is by far the biggest cost in time and tokens, and the saved note keeps only ~5 hero frames regardless of how many were read. So the default scan is capped at 24 frames. For a talking-head or screen-recording video that is plenty; reading 80 frames of a 17-minute tutorial is slow for almost no extra signal.
  • Going denser when a video needs it: raise --max-frames (e.g. --max-frames 60) for fast-cut or visually dense videos where on-screen detail matters, or focus a section with --start/--end (denser per-second budget). Without an explicit cap the per-duration ceiling is roughly ≤30s up to 30, 30-60s ~40, 1-3min ~60, 3-10min ~80, >10min 100 sparse - all clamped to 2 fps and 100 frames.
  • Best accuracy: videos under 10 minutes. Frame coverage scales inversely with duration.
  • For a long video, prefer focusing a --start/--end range over a sparse full-video scan.

How to invoke

Step 1 - parse the user input. Separate the video source from any question the user asked. The question (or the user's prior stated interest) IS the intent - pass it to the script via --intent. Example: /watch https://youtu.be/abc what's the hook pattern? → source = https://youtu.be/abc, intent = what's the hook pattern?. If no question is given, use a brief inferred intent ("general summary") so the report's TL;DR has a lens. The intent shapes how the report's TL;DR and entity/concept sections get filled at Step 4 - same video with intent "pricing tactics" vs "editing style" produces different reports.

Step 2 - run the watch script. Pass the source verbatim. Do not shell-escape it yourself beyond normal quoting. Default to `--captions-only` (the privacy posture: no audio leaves the machine) and, on Windows, use python not python3:

bash
python "${CLAUDE_SKILL_DIR}/scripts/watch.py" "<source>" --captions-only --max-frames 24 --intent "<intent string>"

Keep --max-frames 24 by default (see Recommended limits - it is the main speed lever). Only raise it, or switch to a focused --start/--end range, when the user wants on-screen detail or the video is genuinely fast-cut. If the video has no captions, the script prints NO_CAPTIONS_CONSENT_REQUIRED and returns frames-only; follow the consent gate (see "Privacy posture and the consent gate" above) before re-running without --captions-only. Pass --intent whenever you have any signal from the user about why they want this video - the question they asked, a stated goal, or a brief inferred summary. Empty --intent works but produces less-targeted report sections.

Optional flags:

  • --start T / --end T - focus on a section. Accepts SS, MM:SS, or HH:MM:SS. When either is set, fps auto-scales denser (see "Focusing on a section" below).
  • --max-frames N - lower the cap for tighter token budget (e.g. --max-frames 40)
  • --resolution W - change frame width in px (default 512; bump to 1024 only if the user needs to read on-screen text)
  • --fps F - override auto-fps (clamped to 2 fps max). Setting --fps disables scene-change sampling.
  • --out-dir DIR - keep working files somewhere specific (default: an auto-generated tmp dir)
  • --whisper groq|openai - force a specific Whisper backend (default: prefer Groq if both keys exist)
  • --no-whisper - disable the Whisper fallback entirely (frames-only if no captions)
  • --captions-only - privacy posture: never send audio off-machine. Uses native captions if present; if none, emits a NO_CAPTIONS_CONSENT_REQUIRED sentinel and proceeds frames-only. Implies --no-whisper and also blocks the hook microscope's audio upload (which plain --no-whisper historically did not).
  • --no-scene-change - force uniform frame sampling (debug only; usually leave on)
  • --no-hook-microscope - skip the 0-10s dense pass (saves ~1 Whisper call)

Privacy posture and the consent gate (default --captions-only)

By default, run with --captions-only so no audio ever leaves the machine without the user's say-so. Two outcomes:

  1. The video has native captions: you get a full transcript for free, nothing is uploaded, proceed to Step 3 as normal.
  2. The video has no captions: the script prints NO_CAPTIONS_CONSENT_REQUIRED (to stderr and inside the report's Transcript section) and returns frames-only. Before doing anything else, use AskUserQuestion to ask the user: "This video has no native captions. Transcribing it means sending its audio to a third-party Whisper API (Groq or OpenAI). Proceed with transcription, or analyze frames-only?" Only if the user consents, re-run the same command without --captions-only (optionally adding --whisper groq|openai). If they decline, continue frames-only.

Never drop the --captions-only flag on the user's behalf. The whole point is that the audio upload is an explicit, consented step, not a silent fallback. (Note: re-running re-downloads the video; a future --list-subs probe in download.py could avoid that, but it is not required for correctness.)

Focusing on a section (higher frame rate)

When the user asks about a specific moment - "what happens at the 2 minute mark?", "zoom into 0:45 to 1:00", "the first 10 seconds" - pass --start and/or --end. The script switches to focused-mode budgets, which are denser than full-video budgets (still capped at 2 fps):

  • ≤5s → 2 fps (up to 10 frames)
  • 5-15s → 2 fps (up to 30 frames)
  • 15-30s → ~2 fps (up to 60 frames)
  • 30-60s → ~1.3 fps (up to 80 frames)
  • 60-180s → ~0.6 fps (100 frames, capped)

Focused mode is the right call for:

  • Any moment/range the user names explicitly ("around 2:30", "the intro", "the last 30 seconds").
  • Any video longer than ~10 minutes where the user's question is about a specific part - running focused on the relevant section is far more useful than a sparse scan of the whole thing.
  • Re-runs after a full scan didn't have enough detail in some region.

Transcript is auto-filtered to the same range. Frame timestamps are absolute (real video timeline, not offset-from-start).

Examples:

bash
# Last 10 seconds of a 1 minute video
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" video.mp4 --start 50 --end 60

# Zoom into 2:15 → 2:45 at 3 fps (90 frames)
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "$URL" --start 2:15 --end 2:45 --fps 3

# From 1h12m to the end of the video
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "$URL" --start 1:12:00

Step 3 - Read every frame path the script lists. The Read tool renders JPEGs directly as images for you. Read all frames in a single message (parallel tool calls) so you see them together. The frames are in chronological order with a t=MM:SS timestamp so you can align them to the transcript.

Step 4 - answer the user, then fill the report. You now have three streams of evidence:

  • Frames - what's on screen at each timestamp
  • Transcript - what's said at each timestamp
  • `report.md` - structured artifact at <workdir>/report.md with <!-- pending Claude fill: ... --> markers

First, answer the user's question in chat citing timestamps.

Then, fill in the pending markers in `report.md` using the Edit tool. Walk every <!-- pending Claude fill: ... --> in order:

  • TL;DR - 3-5 bullets through the lens of the user's intent (read from the frontmatter)
  • Key moments - 5-10 timestamped bullets
  • Hook microscope interpretation - frame-by-frame: visual change × what's said; identify the hook pattern (question, contrarian claim, in-medias-res, demo-first, etc.)
  • Editorial profile fingerprint - one-line style summary inferred from pacing numbers + hero frames
  • Quotable moments - top 3-5 punchy, standalone lines from the transcript
  • Entities mentioned - people, companies, tools, places - formatted to match wiki/entities/ slugs (kebab-case, lowercase). Use [[wikilink]] style.
  • Concepts surfaced - frameworks, mental models, named patterns - short gist each

The fully-filled report.md is what gets saved to the vault at Step 4.4. Do not skip the fill - an empty-marker skeleton is not a usable note.

Step 4.4 - Save the finished note to the vault. After filling every marker, resolve $VAULT_DIR per the Configuration section, including the first-run ask when nothing is configured yet. Only if the user explicitly chooses to skip saving do you skip this step (emit Report (not saved): <workdir>/report.md in chat instead).

When $VAULT_DIR resolves:

  1. Derive two names. Slugify the video title from report.md frontmatter (lowercase, ASCII-only, hyphens, max 60 chars). Call that <name>. The dated folder is <name>-YYYY-MM-DD. Example: <name> = andrej-karpathy-just-10xd-everyones-claude-code, folder = andrej-karpathy-just-10xd-everyones-claude-code-2026-06-30.
  2. Create the note folder: $VAULT_DIR/Video notes/<name>-YYYY-MM-DD/.
  3. Save the note named after the video. Save the filled report as <name>.md (NOT report.md) inside that folder, and copy every hero frame (the filenames in the report frontmatter under hero_frames:) alongside it. Do NOT copy the downloaded video or the full frame set; those stay in the temp work dir and get deleted at Step 5. Naming the note after the video keeps each one a distinct node in Obsidian's graph; a vault full of report.md files all collapse into one ambiguous "report" node.
  4. Embed the hero frames inline so they render in Obsidian. Ensure <name>.md has a ## Hero frames section near the end with one Obsidian embed per hero frame, ![[frame_XXXX.jpg]], each with a one-line t=MM:SS caption. The frames sit in the same folder, so Obsidian resolves them by filename.
  5. Echo the vault-relative path in chat on its own line: Report saved: Video notes/<name>-YYYY-MM-DD/<name>.md.

This is the report-only posture: the keepers are the filled note (<name>.md) and its hero frames. No Obsidian wiki ingest, no log.md, no obsidian:// auto-open; the user opens the note in Obsidian themselves. (Edge case: re-watching the exact same video on another day produces the same <name>.md in a different dated folder, which Obsidian treats as a name collision. If that ever matters, append the date to the filename too.)

Step 5 - clean up. The script prints a working directory at the end. Once Step 4.4 has copied report.md and the hero frames into the vault, the temp work dir (downloaded video, full frame set, any audio) is no longer needed - rm -rf it. If the user might ask follow-up questions about the video, leave it until they are done, then delete it.

Transcription

The script gets a timestamped transcript in one of two ways:

  1. Native captions (free, preferred). yt-dlp pulls manual or auto-generated subtitles from the source platform if available.
  2. Whisper API fallback. If no captions came back (or the source is a local file), the script extracts audio (ffmpeg -vn -ac 1 -ar 16000 -b:a 64k, ~0.5 MB/min) and uploads it to whichever Whisper API has a key configured:

- Groq - whisper-large-v3. Preferred default: cheaper, faster. Get a key at console.groq.com/keys. - OpenAI - whisper-1. Fallback. Get a key at platform.openai.com/api-keys.

Both keys live in ~/.config/watch/.env. The script prefers Groq when both are set; override with --whisper openai to force OpenAI. Use --no-whisper to skip the fallback entirely.

Failure modes and handling

  • Setup preflight failed → run python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" (auto-installs ffmpeg/yt-dlp via brew on macOS, scaffolds the .env). For API key, ask the user via AskUserQuestion and write it to ~/.config/watch/.env.
  • No transcript available → captions missing AND (no Whisper key OR Whisper API failed). Script prints a hint pointing to setup. Proceed frames-only and tell the user.
  • Long video warning printed → acknowledge it in your answer. Offer to re-run focused on a specific section via --start/--end rather than a sparse full-video scan.
  • Download fails → yt-dlp's error goes to stderr. If it's a login-required or region-locked video, tell the user plainly; do not keep retrying.
  • Download fails with HTTP 403 Forbidden (captions may still download, but the video file does not) → yt-dlp may need browser impersonation. First check whether it is already installed before touching an installer:

- Check: python -c "import curl_cffi" (python, not python3, on Windows). Exit 0 means it is already present. - If `curl_cffi` is already present: impersonation is NOT the problem. The 403 is most likely transient (YouTube throttling) or a login/region-locked video. Retry the same command once. If it still 403s, tell the user plainly and stop. Do NOT install anything. - If `curl_cffi` is missing: tell the user this video needs browser impersonation and ask, via AskUserQuestion, whether to install curl_cffi (a small Python package, pip install curl_cffi). Only on yes, run the installer (python "${CLAUDE_SKILL_DIR}/scripts/setup.py") and retry. On no, report that this video cannot be downloaded without it and stop. Never install software without consent.

  • Whisper request fails → the error is printed to stderr (likely: invalid key, rate limit, or 25 MB upload limit on a very long video). The report will say "none available" for transcript. You can retry with --whisper openai if Groq failed (or vice versa).
  • Report has unfilled `<!-- pending Claude fill: ... -->` markers → you skipped Step 4. Go back, read the report, fill every marker via Edit, then save to the vault. Never save a half-filled report.
  • Vault save fails (path missing or permissions) → tell the user; the filled report.md still lives in the temp work dir, so nothing is lost. Re-run the copy once the path is fixed, or hand them the work-dir path.

Token efficiency

This skill burns tokens primarily on frames. Order of magnitude:

  • 80 frames at 512px wide is roughly 50-80k image tokens depending on aspect ratio.
  • The transcript is cheap (a few thousand tokens at most for a 10-minute video).
  • Bumping --resolution to 1024 roughly quadruples the image tokens per frame. Only do it when necessary.

If you already watched a video this session and the user asks a follow-up, do not re-run the script - you already have the frames and transcript in context. Just answer from what you have.

Security & Permissions

What this skill does:

  • Runs yt-dlp locally to download the video and pull native captions when the source supports them (public data; the request goes directly to whatever host the URL points at)
  • Runs ffmpeg / ffprobe locally to extract frames as JPEGs and, when Whisper is needed, a mono 16 kHz audio clip
  • Sends the extracted audio clip to Groq's Whisper API (api.groq.com/openai/v1/audio/transcriptions) when GROQ_API_KEY is set (preferred - cheaper, faster)
  • Sends the extracted audio clip to OpenAI's audio transcription API (api.openai.com/v1/audio/transcriptions) when OPENAI_API_KEY is set and Groq is not, or when --whisper openai is forced
  • Writes the downloaded video, frames, audio, and an intermediate transcript to a working directory under the system temp dir (or --out-dir if specified) so Claude can Read them
  • Reads / creates ~/.config/watch/.env (mode 0600) to store the Whisper API key(s) and a SETUP_COMPLETE marker. As a fallback, also reads .env in the current working directory
  • Writes a structured report.md plus copies of hero frames into $VAULT_DIR/Video notes/<slug>/ when a vault is detected at Step 4.4

What this skill does NOT do:

  • Does not upload the video itself to any API - only the extracted audio goes out, and only when native captions are missing AND Whisper is not disabled with --no-whisper
  • Does not access any platform account (no login, no session cookies, no posting)
  • Does not share API keys between providers (Groq key only goes to api.groq.com, OpenAI key only goes to api.openai.com)
  • Does not log, cache, or write API keys to stdout, stderr, or output files
  • Does not persist anything outside the working directory and ~/.config/watch/.env (and the chosen Obsidian vault when one is detected) - clean up the working directory when you're done (Step 5)
  • Does not write anything into the vault beyond the report-only note (report.md + hero frames) under Video notes/<slug>/; no wiki pages, no log.md, no obsidian:// auto-open

Bundled scripts: scripts/watch.py (entry point), scripts/download.py (yt-dlp wrapper), scripts/frames.py (ffmpeg uniform + scene-change extraction + hero selection), scripts/pacing.py (editorial metrics), scripts/hook.py (0-10s microscope), scripts/report.py (structured report emitter), scripts/transcribe.py (caption selection + Whisper orchestration), scripts/whisper.py (Groq / OpenAI clients, supports word-level timestamps), scripts/setup.py (preflight + installer)

Review scripts before first use to verify behavior.

Install & Usage

1
Create the skills directory
mkdir -p .claude/skills
2
Download the skill file

Add the configuration to .claude/skills/watch.md

3
Invoke in Claude Code
/watch

Use Cases

Analyze a lecture recording to extract key slides and spoken highlights for study notes.
Review a product demo video to identify pacing issues and improve future content.
Extract quotes and timestamps from a conference talk for a blog post summary.
Evaluate the hook of a YouTube video by examining the first 10 seconds in detail.
Create an Obsidian note from a tutorial video with scene frames and transcript for later reference.
Compare editorial pacing across multiple videos to understand effective storytelling techniques.

Usage Examples

1

/watch https://youtube.com/watch?v=example

2

/watch /path/to/local/video.mp4

3

Watch this video and tell me the key moments and pacing: https://vimeo.com/example

View source on GitHub

Security Audits

LicenseUnknownSourceWarnRepositoryPass

Frequently Asked Questions

What is watch?

This skill equips Claude with the ability to watch a video by downloading it, extracting frames at scene changes, capturing native captions (or Whisper transcript with consent), and computing editorial pacing metrics like cuts per minute and mean shot length. It produces a structured report with a TL;DR, hook analysis, key moments, and quotable quotes, then saves the final note into your Obsidian vault for easy reference.

How to install watch?

To install watch: create the skills directory (mkdir -p .claude/skills), then add the config to .claude/skills/watch.md. Finally, /watch in Claude Code.

What is watch best for?

watch is a other categorized under General. Created by alexander-rehm.

What can I use watch for?

watch is useful for: Analyze a lecture recording to extract key slides and spoken highlights for study notes.; Review a product demo video to identify pacing issues and improve future content.; Extract quotes and timestamps from a conference talk for a blog post summary.; Evaluate the hook of a YouTube video by examining the first 10 seconds in detail.; Create an Obsidian note from a tutorial video with scene frames and transcript for later reference.; Compare editorial pacing across multiple videos to understand effective storytelling techniques..