By the end of this tutorial, you’ll have a working multi-step web automation workflow powered by Claude computer use automation — one that can navigate real browser UIs, fill forms, extract structured data across pages, and recover gracefully when things break. We’re using Claude’s vision capabilities plus Playwright to drive a real browser, not a headless scraper that collapses the moment a site adds a cookie banner.
- Install dependencies — Set up Anthropic SDK, Playwright, and screenshot tooling
- Capture and send screenshots — Build the vision loop that feeds browser state to Claude
- Parse Claude’s action output — Translate model responses into Playwright commands
- Build the multi-step loop — Chain actions with state tracking and retries
- Add error recovery and fallback logic — Handle popups, CAPTCHAs, and stale elements
- Optimize cost per run — Screenshot resolution tuning and action batching
Why Claude’s Computer Use Beats Traditional Scraping for Complex Flows
If you can write XPath selectors that survive a site redesign, do that. It’s faster, cheaper, and more reliable. But the moment you hit a multi-step workflow — log in, navigate nested menus, filter a data table, export a CSV, repeat for 40 accounts — CSS selectors start rotting within weeks. A login flow with a dynamic token, a React table that re-renders on scroll, a form that conditionally shows fields: these are where Claude computer use automation earns its keep.
Claude doesn’t parse HTML. It looks at a screenshot of what the browser is actually rendering and decides what to click next, exactly like a human would. That makes it robust to DOM changes that would break a traditional scraper overnight.
The tradeoff is cost and speed. Each screenshot-to-action cycle burns tokens. A typical step costs roughly $0.003–0.008 with claude-3-5-sonnet-20241022 depending on image resolution — so a 20-step workflow runs you around $0.06–0.16 per execution. That’s fine for anything that used to require a human; it’s not fine if you’re scraping 50,000 pages a day.
Step 1: Install Dependencies
# Python 3.10+
pip install anthropic playwright pillow
# Install browser binaries
playwright install chromium
You’ll need an Anthropic API key with computer use enabled — as of mid-2025 this is available on all paid tiers but you need to explicitly pass the betas=["computer-use-2024-10-22"] flag. Don’t skip that; the API returns a 400 without it.
Step 2: Capture and Send Screenshots
The core loop is simple: take a screenshot, send it to Claude with a task description, get back an action, execute the action, repeat.
import anthropic
import base64
from playwright.async_api import async_playwright
from PIL import Image
import io
DISPLAY_WIDTH = 1280
DISPLAY_HEIGHT = 800
async def take_screenshot(page) -> str:
"""Capture screenshot and return as base64 string."""
screenshot_bytes = await page.screenshot(
full_page=False, # viewport only — full_page inflates token cost dramatically
type="jpeg",
quality=75 # JPEG at 75% is readable but ~40% smaller than PNG
)
# Resize if needed — 1280x800 is the sweet spot for Claude's vision
img = Image.open(io.BytesIO(screenshot_bytes))
img = img.resize((DISPLAY_WIDTH, DISPLAY_HEIGHT), Image.LANCZOS)
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=75)
return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
The JPEG at 75% quality tip alone can cut your image token cost by 30–40% with no visible loss in Claude’s ability to identify UI elements. PNG screenshots are beautiful and expensive. Don’t use them here.
Step 3: Parse Claude’s Action Output
Claude’s computer use responses come back as structured tool calls. You need to map those to Playwright commands.
async def execute_action(page, action: dict) -> str:
"""Execute a single computer use action from Claude's response."""
action_type = action["type"]
if action_type == "screenshot":
return await take_screenshot(page) # Claude requesting a fresh look
elif action_type == "mouse_move":
await page.mouse.move(action["coordinate"][0], action["coordinate"][1])
elif action_type == "left_click":
await page.mouse.click(action["coordinate"][0], action["coordinate"][1])
await page.wait_for_load_state("networkidle", timeout=5000) # Wait for any navigation
elif action_type == "type":
await page.keyboard.type(action["text"], delay=50) # Small delay mimics human typing
elif action_type == "key":
await page.keyboard.press(action["key"])
elif action_type == "scroll":
await page.mouse.wheel(0, action.get("direction", 1) * 300)
else:
raise ValueError(f"Unknown action type: {action_type}")
return await take_screenshot(page) # Always return updated state
Step 4: Build the Multi-Step Automation Loop
This is the actual orchestrator. It maintains conversation history (so Claude remembers what it’s already done), sends each screenshot, and processes the returned actions.
async def run_automation(task: str, start_url: str, max_steps: int = 30):
"""
Main automation loop.
task: natural language description of the goal
start_url: where to begin
max_steps: hard ceiling to prevent infinite loops (and runaway cost)
"""
client = anthropic.Anthropic()
messages = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False) # headless=True in production
page = await browser.new_page(viewport={"width": DISPLAY_WIDTH, "height": DISPLAY_HEIGHT})
await page.goto(start_url)
await page.wait_for_load_state("networkidle")
screenshot_b64 = await take_screenshot(page)
# Initial message with the task and first screenshot
messages.append({
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": screenshot_b64}
},
{"type": "text", "text": f"Task: {task}\n\nYou can see the current browser state. Complete the task step by step."}
]
})
for step in range(max_steps):
response = client.beta.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=[{
"type": "computer_20241022",
"name": "computer",
"display_width_px": DISPLAY_WIDTH,
"display_height_px": DISPLAY_HEIGHT,
}],
messages=messages,
betas=["computer-use-2024-10-22"],
system="You are a browser automation agent. Complete the given task efficiently. When the task is done, output TASK_COMPLETE followed by a JSON summary of what you accomplished."
)
# Check for completion signal
for block in response.content:
if hasattr(block, "text") and "TASK_COMPLETE" in block.text:
print(f"Completed in {step + 1} steps")
return extract_result(block.text)
# Process tool use actions
tool_results = []
for block in response.content:
if block.type == "tool_use" and block.name == "computer":
new_screenshot = await execute_action(page, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": [{
"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": new_screenshot}
}]
})
# Add assistant response and tool results to history
messages.append({"role": "assistant", "content": response.content})
if tool_results:
messages.append({"role": "user", "content": tool_results})
raise RuntimeError(f"Task did not complete within {max_steps} steps")
The max_steps hard ceiling is non-negotiable in production. Without it, a confused model will happily click in circles for 200 iterations and bill you $15 for a task that should cost $0.10. This is part of broader error handling and fallback logic for production Claude agents that every serious deployment needs.
Step 5: Add Error Recovery and Fallback Logic
The happy path works. What actually breaks in production:
Cookie Banners and Modal Popups
Claude will often try to dismiss these correctly, but you can also pre-empt them with Playwright before starting the loop:
async def dismiss_common_overlays(page):
"""Try common cookie/popup dismissal before handing to Claude."""
selectors = [
"button[id*='accept']", "button[id*='cookie']",
"button[class*='consent']", "[aria-label*='Accept']",
"button:has-text('Accept')", "button:has-text('I agree')"
]
for selector in selectors:
try:
element = await page.query_selector(selector)
if element and await element.is_visible():
await element.click()
await page.wait_for_timeout(500)
return True
except Exception:
continue
return False
Stale Screenshots After Navigation
If Claude clicks a link and the page navigates, the next screenshot needs to wait for the new page to settle. Wrap execute_action in a navigation guard:
async def safe_execute_action(page, action: dict, timeout: int = 8000) -> str:
try:
await execute_action(page, action)
# Wait for any pending navigation to settle
await page.wait_for_load_state("domcontentloaded", timeout=timeout)
await page.wait_for_timeout(800) # Extra buffer for JS-heavy SPAs
return await take_screenshot(page)
except Exception as e:
# Return screenshot of current state so Claude can assess the error
print(f"Action failed: {e}. Sending current state to Claude.")
return await take_screenshot(page)
The principle here mirrors what’s described in our guide on LLM fallback and retry logic for production — when an action fails, you don’t crash, you show the model the current state and let it self-correct.
Claude Getting Confused Mid-Workflow
If the model starts clicking the same element repeatedly or produces no tool calls for two consecutive turns, reset with an explicit clarification message:
def detect_loop(messages: list, lookback: int = 4) -> bool:
"""Detect if Claude is stuck in a repetitive action pattern."""
if len(messages) < lookback * 2:
return False
recent_actions = []
for msg in messages[-lookback*2:]:
if isinstance(msg.get("content"), list):
for block in msg["content"]:
if isinstance(block, dict) and block.get("type") == "tool_use":
recent_actions.append(str(block.get("input", {})))
# If 3+ of the last 4 actions are identical, we're looping
if len(recent_actions) >= 3:
return len(set(recent_actions[-3:])) == 1
return False
Step 6: Optimize Cost Per Run
Three levers matter most for Claude computer use automation cost:
- Image resolution: 1280×800 JPEG at 75% quality is the sweet spot. Going to 1920×1080 PNG roughly triples image token cost with minimal benefit for most UIs.
- Conversation pruning: After 10+ turns, the message history gets expensive. Prune tool result images from older turns — keep only the last 3 screenshots in context. Claude doesn’t need to see where it was 8 clicks ago.
- Task decomposition: Break long workflows into logical segments with separate API sessions. Logging in and landing on the dashboard → one session. Extracting data from the dashboard → fresh session starting from a screenshot of the authenticated state.
For high-volume use cases where you’re running similar workflows repeatedly, consider whether you actually need vision for every step. A hybrid approach — Playwright selectors for known-stable elements, Claude vision only for the ambiguous parts — can cut cost by 60–70%.
Running a Real Workflow: Competitor Pricing Extraction
Here’s how you’d invoke the full automation for a realistic task:
import asyncio
async def main():
result = await run_automation(
task="""
Navigate to the pricing page. Extract all plan names, monthly prices,
and feature lists. If there's an annual pricing toggle, click it and
also capture annual prices. Output the data as JSON when done.
""",
start_url="https://example-saas.com/pricing",
max_steps=20
)
print(result)
asyncio.run(main())
This kind of automated competitive intelligence workflow pairs well with structured data pipelines — once you’ve extracted the pricing, you can feed it into the same pattern we cover in our AI-powered competitor monitoring guide.
Common Errors
Error: “Could not find element at coordinates”
Cause: Claude is targeting coordinates from a previous screenshot that no longer reflect the current page state — often caused by dynamic content shifting the layout.
Fix: Always take a fresh screenshot immediately before sending coordinates to Claude. Never cache coordinates across multiple actions.
Error: API returns 400 “computer use not enabled”
Cause: Missing the beta header.
Fix: Add betas=["computer-use-2024-10-22"] to every messages.create call. This isn’t inherited from a client-level config.
Error: Workflow exceeds max_steps without completing
Cause: Usually one of three things — a modal blocking the target element, a page that requires login state Claude doesn’t have, or a task description vague enough that Claude can’t determine when it’s done.
Fix: Run with headless=False and watch what the browser is doing. Add explicit completion criteria to the task description: “When you have extracted prices for all plans, output TASK_COMPLETE and the JSON.” Vague tasks produce vague stopping conditions.
What to Build Next
The natural extension here is a supervised multi-agent pipeline where one Claude instance handles the browser automation and a second instance validates the extracted data against a schema — catching cases where the first agent returned a partial result or hallucinated a price. This maps directly to the structured output verification patterns in our guide on reducing LLM hallucinations in production. Add a Postgres or Supabase table to persist results between runs and you have a continuously-updating competitive intelligence feed that costs a few cents per competitor per week.
Bottom Line: Who Should Use This Approach
Solo founders and small teams: Use this for workflows that currently require manual browser work — form submissions, portal-based data pulls, login-gated pages. Even at $0.10–0.20 per run it’s worth it if you’re replacing 15 minutes of human time.
Budget-conscious builders: Start with the hybrid approach. Use Playwright selectors for stable elements, drop into Claude vision only when the UI is dynamic or you hit an obstacle. Profile which steps actually need vision and which don’t.
Enterprise teams: Claude computer use automation is production-viable now, but you need the observability layer — log every screenshot, every action, every cost. Without that, debugging failures in unattended runs is painful. Tools like Langfuse or Helicone can attach to the API calls for tracing.
The Claude tool use with Python guide is the right next read if you want to extend this with custom tools that Claude can call alongside the browser — fetching from internal APIs, writing to databases, or triggering webhooks as part of the same workflow.
Frequently Asked Questions
How much does Claude computer use cost per automation run?
At current claude-3-5-sonnet-20241022 pricing, each screenshot-action cycle costs roughly $0.003–0.008 depending on image resolution. A 20-step workflow typically runs $0.06–0.16. Using JPEG compression at 75% and pruning conversation history after 10+ turns can reduce that by 30–40%.
Can Claude computer use handle login-protected websites?
Yes — Claude can fill in username and password fields and complete standard login flows, including those with OTP if you inject the code programmatically. For sites with CAPTCHA on login, you’ll need a CAPTCHA solving service (2captcha, Anti-Captcha) as a pre-step before handing control to Claude.
What’s the difference between Claude computer use and a traditional Playwright scraper?
Playwright scrapers use CSS/XPath selectors tied to the DOM structure — they’re faster and cheaper but break whenever the site’s markup changes. Claude computer use operates on rendered screenshots, so it’s resilient to DOM changes but slower (2–8 seconds per action) and costs API tokens per step. Use Playwright for stable, high-volume tasks and Claude vision for complex, unpredictable workflows.
How do I prevent the automation from running indefinitely?
Set a hard max_steps ceiling in your loop (20–30 is reasonable for most workflows) and include an explicit completion signal in your system prompt — something like “when done, output TASK_COMPLETE.” Also implement loop detection that checks whether the last 3–4 actions are identical, which usually indicates the model is stuck.
Does this work with headless browsers?
Yes, set headless=True in p.chromium.launch() for production. During development, run with headless=False so you can watch what Claude is doing — it makes debugging significantly faster. Some sites detect headless browsers; if you hit that, add args=["--disable-blink-features=AutomationControlled"] to the launch options.
Can I run multiple automation sessions in parallel?
Yes, but watch your Anthropic API rate limits — concurrent requests count toward your tokens-per-minute ceiling. Playwright supports multiple browser contexts in a single process, so you can run 3–5 parallel sessions efficiently. Beyond that, spawn separate processes to avoid event loop contention.
Put this into practice
Try the Web Vitals Optimizer agent — ready to use, no setup required.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

