Computer Use Best Practices — Anthropic Official Guide

What it is

Anthropic 官方发布的 Claude Computer Use / Browser Use 生产部署指南，涵盖分辨率缩放、点击精度、思考力度调优、提示注入防御、上下文管理和实验性工具配置。

Why it matters

Computer use agents interact with untrusted content by design. Every screenshot and webpage could contain adversarial instructions. Without deliberate engineering around resolution scaling, context management, and safety, production deployments fail on basic click accuracy or security gaps.

Key points

Resolution and click accuracy

Pre-downscale screenshots before sending to API. The most common cause of poor click accuracy is sending native-resolution screenshots that exceed API limits and get silently downscaled.
Claude 4.6 limits: max long edge 1568px, max total pixels 1.15MP.
Opus 4.7 limits: max long edge 2576px, max total pixels 3.75MP.
Recommended default: 1280x720 for 4.6 family; 1080p for Opus 4.7.
Coordinate scaling is critical: scale API-returned coordinates back to native screen resolution before executing clicks.
Content ordering: place text instruction before the image in the messages array.
Model selection: Sonnet 4.6 tends to be more mechanically precise at clicking; Opus 4.7 narrows this gap with higher resolution budget.

Thinking effort tuning

Claude 4.6: medium is the sweet spot — close to highest success rate at roughly half the tokens of high. low is surprisingly strong for cost-sensitive workloads.
Claude Opus 4.7: high achieves near-maximum success rate while using roughly half the tokens of max.
Avoid max effort for computer use on 4.6 models — no accuracy benefit over high while increasing cost. UI tasks are perceptual, not deeply logical.

Prompt injection defense

Training-time robustness: RL builds resistance directly into Claude's capabilities.
Real-time classifiers: scan content entering context window and flag potential injection attempts.
Built-in classifiers run automatically when using official computer_20251124 tool type — zero additional latency, no extra cost.
Best practices regardless of classifier use: human-in-the-loop for high-stakes actions, scope agent permissions narrowly, monitor and log all actions, treat all web content as untrusted.

Context management for long-running agents

Screenshots accumulate fast: each consumes roughly 1,000–1,800 tokens. A 200k context window fills in well under 100 screenshots.

Three layers that compose cleanly:

Cache breakpoints: one on stable prefix (system prompt), up to three on most recent tool results. Spreading breakpoints across recent positions gives graceful degradation.
Cache-aware rolling buffer: keep most recent keep_n=3 screenshots; when total exceeds keep_n + interval=25, replace oldest interval screenshots with placeholders in a single pass. Prefix stays byte-identical between prunes.
LLM-based compaction: summarize conversation history before discarding. Critical sections: user instructions (verbatim), task template, constraints, actions taken, errors and fixes, progress tracking, current state, next step.

Server-side compaction (beta): pass custom summarization prompt as instructions parameter in context_management. Set pause_after_compaction to attach most recent messages across events. Mirror server truncation on client to keep views aligned.

Experimental settings

Batch tools (computer_batch, browser_batch): execute multiple sub-actions in single tool call. Use when sub-actions are self-contained and don't depend on each other's visual outcomes. Avoid in exploratory navigation or error-recovery sequences.
Advisor tool: pairs executor model with higher-intelligence advisor model for strategic guidance mid-generation. Useful when most turns are mechanical but occasional planning moments need Opus-level reasoning. Cleanup orphaned advisor blocks when disabling the tool.
Teach Mode: record human performing a workflow (screenshots, actions, optional voice), then replay as context. Not strict replay — Claude adapts to UI changes. Supports strict, adaptive, and goal-oriented playback modes.

Open questions

How do batch tools affect error recovery when one sub-action in a batch misses its target?
Does the advisor tool pattern generalize beyond computer use to other long-horizon agent tasks?
What is the cost breakpoint where server-side compaction pays for itself vs rolling buffer alone?

Prompts for witness

What is the most frequent failure mode in your computer use integration, and which layer (resolution, thinking effort, context management, safety) would fix it?
If you had to set a single "taste standard" for agent-generated UI interactions, what would it be?

harness-engineering/browser-harness — Self-healing CDP browser agent architecture
harness-engineering/agent-memory-vs-context-substrate — Context management paradigms
claude-code/overview — Claude Code ecosystem

Computer Use Best Practices — Anthropic Official Guide

Computer Use Best Practices — Anthropic Official Guide

What it is

Why it matters

Key points

Resolution and click accuracy

Thinking effort tuning

Prompt injection defense

Context management for long-running agents

Experimental settings

Open questions

Prompts for witness

Sources

Evolution

Derived from source material

Linked from

Computer Use Best Practices — Anthropic Official Guide

What it is

Why it matters

Key points

Resolution and click accuracy

Thinking effort tuning

Prompt injection defense

Context management for long-running agents

Experimental settings

Open questions

Prompts for witness

Related

Sources

Evolution

Derived from source material

Linked from