Computer Use Best Practices — Anthropic Official Guide
What it is
Anthropic 官方发布的 Claude Computer Use / Browser Use 生产部署指南,涵盖分辨率缩放、点击精度、思考力度调优、提示注入防御、上下文管理和实验性工具配置。
Why it matters
Computer use agents interact with untrusted content by design. Every screenshot and webpage could contain adversarial instructions. Without deliberate engineering around resolution scaling, context management, and safety, production deployments fail on basic click accuracy or security gaps.
Key points
Resolution and click accuracy
- Pre-downscale screenshots before sending to API. The most common cause of poor click accuracy is sending native-resolution screenshots that exceed API limits and get silently downscaled.
- Claude 4.6 limits: max long edge 1568px, max total pixels 1.15MP.
- Opus 4.7 limits: max long edge 2576px, max total pixels 3.75MP.
- Recommended default: 1280x720 for 4.6 family; 1080p for Opus 4.7.
- Coordinate scaling is critical: scale API-returned coordinates back to native screen resolution before executing clicks.
- Content ordering: place text instruction before the image in the messages array.
- Model selection: Sonnet 4.6 tends to be more mechanically precise at clicking; Opus 4.7 narrows this gap with higher resolution budget.
Thinking effort tuning
- Claude 4.6:
mediumis the sweet spot — close to highest success rate at roughly half the tokens ofhigh.lowis surprisingly strong for cost-sensitive workloads. - Claude Opus 4.7:
highachieves near-maximum success rate while using roughly half the tokens ofmax. - Avoid
maxeffort for computer use on 4.6 models — no accuracy benefit overhighwhile increasing cost. UI tasks are perceptual, not deeply logical.
Prompt injection defense
- Training-time robustness: RL builds resistance directly into Claude's capabilities.
- Real-time classifiers: scan content entering context window and flag potential injection attempts.
- Built-in classifiers run automatically when using official
computer_20251124tool type — zero additional latency, no extra cost. - Best practices regardless of classifier use: human-in-the-loop for high-stakes actions, scope agent permissions narrowly, monitor and log all actions, treat all web content as untrusted.
Context management for long-running agents
Screenshots accumulate fast: each consumes roughly 1,000–1,800 tokens. A 200k context window fills in well under 100 screenshots.
Three layers that compose cleanly:
- Cache breakpoints: one on stable prefix (system prompt), up to three on most recent tool results. Spreading breakpoints across recent positions gives graceful degradation.
- Cache-aware rolling buffer: keep most recent
keep_n=3screenshots; when total exceedskeep_n + interval=25, replace oldestintervalscreenshots with placeholders in a single pass. Prefix stays byte-identical between prunes. - LLM-based compaction: summarize conversation history before discarding. Critical sections: user instructions (verbatim), task template, constraints, actions taken, errors and fixes, progress tracking, current state, next step.
Server-side compaction (beta): pass custom summarization prompt as instructions parameter in context_management. Set pause_after_compaction to attach most recent messages across events. Mirror server truncation on client to keep views aligned.
Experimental settings
- Batch tools (
computer_batch,browser_batch): execute multiple sub-actions in single tool call. Use when sub-actions are self-contained and don't depend on each other's visual outcomes. Avoid in exploratory navigation or error-recovery sequences. - Advisor tool: pairs executor model with higher-intelligence advisor model for strategic guidance mid-generation. Useful when most turns are mechanical but occasional planning moments need Opus-level reasoning. Cleanup orphaned advisor blocks when disabling the tool.
- Teach Mode: record human performing a workflow (screenshots, actions, optional voice), then replay as context. Not strict replay — Claude adapts to UI changes. Supports strict, adaptive, and goal-oriented playback modes.
Open questions
- How do batch tools affect error recovery when one sub-action in a batch misses its target?
- Does the advisor tool pattern generalize beyond computer use to other long-horizon agent tasks?
- What is the cost breakpoint where server-side compaction pays for itself vs rolling buffer alone?
Prompts for witness
- What is the most frequent failure mode in your computer use integration, and which layer (resolution, thinking effort, context management, safety) would fix it?
- If you had to set a single "taste standard" for agent-generated UI interactions, what would it be?
Related
- harness-engineering/browser-harness — Self-healing CDP browser agent architecture
- harness-engineering/agent-memory-vs-context-substrate — Context management paradigms
- claude-code/overview — Claude Code ecosystem