BeClaude

SoMatic

New
GitHub TrendingGeneralby Smyan1909

Agent-first CLI for native UI automation with Set-of-Marks screenshots. MCP server + headless Xvfb support included.

First seen 5/22/2026

Overview

SoMatic

Use SoMatic when you need to operate a native desktop UI with screenshots, mouse, and keyboard.

How images reach you

MCP path: annotated screenshots arrive inline as image content in the tool response — you see the image directly, no extra step needed.

CLI path: somatic screenshot --annotate writes the annotated PNG to disk and returns its path in screenshot.annotated_path. Use the Read tool on that path to see the image. The JSON output is text only — use Read to get the actual image.

Operating Loop

  1. At the start of a session, run `vision_init` (MCP) or `somatic vision init` (CLI). This loads the YOLO ONNX model into a background daemon. First-ever run may take 1–3 minutes; subsequent runs are near-instant. Require "started": true (or "already_running": true) before continuing.
  1. Always begin a task by taking an annotated screenshot and visually inspecting it.

- MCP: call screenshot_annotated — the image arrives inline. - CLI: run somatic screenshot --annotate, then Read the path at screenshot.annotated_path.

  1. Scan the image before acting. Check every region — taskbar, dock, desktop icons, system tray, open windows — and identify which numbered mark corresponds to the element you want.
  1. Prefer clicks on visible elements over keyboard navigation. If the target is already present in the annotated screenshot (a taskbar icon, a tab, a button, a link), click its mark id. Do not open Start menu / Run / search when the thing you want is already on screen.
  1. Use the keyboard for what keyboards are for, not as a shortcut around looking. Keyboard is the right tool for:

- typing free-form text (type_text "hello") - key chords that aren't UI elements (hotkey ctrl s, press enter, hotkey alt tab) - launching something that genuinely isn't visible anywhere (then use Win+S, type, screenshot the results, click the best match by mark id — don't blind-press Enter)

  1. Inspect the JSON returned by the screenshot tool: marks contains id, bbox, center, and confidence. There are no captions — refer to elements by id and verify visually.
  1. Click by mark id. `click <id>` automatically clicks the center of that mark's bounding box — you never need to calculate or look up any pixel coordinates.

- click 4 → clicks the center of mark 4's bbox automatically - move 7 → moves the cursor to the center of mark 7's bbox automatically - scroll -5 --target 2 → scrolls near mark 2 automatically - NEVER extract `center` or `bbox` from the JSON and pass raw pixel coordinates like `click 540,320`. That defeats the purpose of mark ids. `click <id>` does it for you.

  1. When YOLO doesn't annotate the exact target — empty text inputs, fields that follow a labelled icon, gaps between buttons — use click_near with a dx/dy offset from the nearest visible mark:

- click_near 12 --dx 300 --dy 0 (300 px to the right of mark 12's center, automatically)

  1. Use raw coordinates only as a last resort when no mark and no nearby anchor exists at all:

- click 640,420

  1. Re-screenshot after every consequential action. Mark IDs are reassigned per screenshot — never apply an id from one screenshot to another screenshot's state.
  1. At the end of the session, run `vision_stop` to free the model's memory.
  1. If something goes wrong, run doctor and vision_status.

Decision Rule: Click, Click-Near, or Type?

When choosing how to advance the task, ask: what does the latest annotated screenshot show?

  • Target visible as a mark → click <id>. SoMatic resolves the id to the bbox center for you. Don't calculate coordinates. Don't open a launcher.
  • Target NOT visible as a mark but adjacent to one → click_near <id> --dx ... --dy .... SoMatic resolves the anchor id to its center and applies your offset. (Common for text inputs that sit next to a + or send button.)
  • Target's container visible but not the target itself → click into the container first, re-screenshot, then act on the new marks.
  • Target genuinely invisible → keyboard shortcut (Win+S to search, Ctrl+L to focus URL bar, etc.). After the keypress, screenshot again before doing anything else.

Anti-patterns — never do these:

  • click 540,320 when you could use click 4 — marks are already positioned at the right place
  • ❌ Reading center from the marks JSON and passing those pixels to clickclick <id> does this automatically
  • ❌ Calculating coordinates by eyeballing the annotated image — click <id> is always more accurate

Command Rules

  • Treat command output as JSON, not prose.
  • Use --dry-run before risky pointer or keyboard actions when planning a move.
  • If a screenshot tool returns vision_unavailable, call vision_init and retry.
  • Don't pre-emptively press Escape or click empty space to "clear state" — trust what the last screenshot shows.

Common Commands

CLI form:

sh
somatic doctor
somatic vision init
somatic vision status
somatic screenshot --annotate              # returns annotated_path; use Read to view
somatic click <id>
somatic click <x,y>
somatic click-near <id> --dx 100 --dy 0
somatic type "text"
somatic hotkey ctrl s
somatic press enter
somatic scroll -4
somatic wait 1
somatic vision stop

MCP form (same names with underscores): vision_init, screenshot_annotated, click, click_near, type_text, hotkey, press, scroll, wait, vision_stop.

Headless mode

If headless_status (or somatic headless status) reports active: true, you are operating against a virtual desktop spun up via Xvfb. Everything in the operating loop above applies unchanged — clicks and screenshots simply target the virtual display instead of the real one.

Safety

SoMatic controls the real desktop session. Do not assume the active window is correct. Verify visible state with screenshots before typing, clicking destructive controls, submitting forms, deleting files, or changing system settings.

Install & Usage

1
Open your MCP config
~/.claude.json
2
Add the server config

Add the configuration to "mcpServers": { "somatic": { "command": "...", "args": [] } }

3
Restart Claude Code
/mcp
View source on GitHub
mcpagent

Security Audits

LicenseUnknownSourceWarnRepositoryPass

Frequently Asked Questions

What is SoMatic?

Agent-first CLI for native UI automation with Set-of-Marks screenshots. MCP server + headless Xvfb support included.

How to install SoMatic?

To install SoMatic: open your mcp config (~/.claude.json), then add the config to "mcpServers": { "somatic": { "command": "...", "args": [] } }. Finally, /mcp in Claude Code.

What is SoMatic best for?

SoMatic is a mcp categorized under General. It is designed for: mcp, agent. Created by Smyan1909.