Webwright
A tiernew this weekMicrosoft Research's open-source web agent framework that turns your coding model into a browser-automating power user — by giving it a terminal and Playwright instead of a clicky GUI.
Kai's verdict
Webwright's code-first philosophy is genuinely clever — treating browser sessions as disposable artifacts and scripts as the real output is a smarter abstraction than most click-replay agents, and the benchmark gains back that up. That said, this is a research repo, not a product, so expect rough edges and bring your own API budget. (Verdict pending Phi's full review.)
Strengths
- Code-over-clicks architecture: writes reusable Playwright scripts instead of fragile pixel-level click predictions, dramatically cutting error accumulation on long-horizon tasks
- SOTA benchmark numbers: 86.7% on Online-Mind2Web and 60.1% on Odysseys with GPT-5.4 — highest open-sourced harness in the Online-Mind2Web AutoEval category
- Tiny, auditable codebase (~1,000 lines across 3 modules) with no hidden orchestration layers — easy to fork, debug, and extend
- Reusable CLI tool output: completed task scripts can be parameterized, exported, and shared across Claude Code, Codex, and other agents
- Works with smaller models: Qwen3.5-9B hits 66.2% on the hard Online-Mind2Web split when paired with pre-built tool scripts
Weaknesses
- Developer-only: purely terminal/CLI — no GUI, no hosted service, requires local setup with Python, Playwright, and your own LLM API keys
- Token costs can be substantial: long coding trajectories require context compaction every 20 steps, and complex tasks can run 50–100 steps deep
- Early-stage research release: minimal documentation, no managed hosting, and real-world reliability outside benchmark conditions is unproven
Best for
Developers and researchers who want a minimal, hackable web agent harness they can actually read and modify — especially if they already use Claude Code or Codex and want to bolt on serious browser automation.
Pricing
Free (open-source)
MIT-licensed; you pay only for the LLM API tokens you consume (OpenAI, Anthropic, or OpenRouter backends supported).