Browser Automation

Agents can browse the web, fill forms, download files, and ask humans for help.

Overview

YokeBot agents have full browser automation capabilities powered by Playwright. Agents can navigate websites, fill forms, click buttons, download files, take screenshots, and complete multi-step online tasks — all autonomously. When they hit ambiguity, they ask the human team for guidance.

The browser is integrated directly into the Workspace as a tab alongside files, data tables, and the video editor. You can observe agent browsing in real-time or take control of the browser yourself.

Two Modes

Mode	Who Drives	Use Case
Agent Browser	Agent drives, human observes.	Watch agents complete online tasks autonomously.
Take Control	Human drives, agent observes.	Record a login, intervene mid-task, or browse manually.

Agent Browser Tools

Agents have access to 10+ browser tools during their heartbeat cycle:

Tool	Description
browser_navigate	Go to a URL. Includes SSRF protection against private IPs and DNS rebinding.
browser_snapshot	Get an accessibility snapshot of the current page for understanding page structure.
browser_click	Click an element by CSS selector or pixel coordinates.
browser_type	Type text into the currently focused input field.
browser_press_key	Press a keyboard key (Enter, Tab, Escape, etc.).
browser_select_option	Select an option from a dropdown menu.
browser_screenshot	Capture a screenshot of the current page state.
browser_fill_form	Fill multiple form fields at once from a structured list.
browser_download_file	Download a file and save it to the team workspace.
browser_ask_human	Ask the human a question with optional multiple-choice answers.

Ask the Human

When an agent encounters ambiguity while browsing — a form field it cannot fill, a choice it cannot make, a CAPTCHA, or a decision that requires business context — it calls browser_ask_human. This:

Captures a screenshot of the current browser state.
Creates an approval request with the question, optional answer choices, and context.
Posts a message in team chat with the screenshot and a link to respond.
Keeps the browser session open (extended idle timeout) while waiting.
When the human responds, returns the answer to the agent, which continues browsing.

Agent: "Which shipping option should I select?"
Options: Standard ($5.99), Express ($12.99), Overnight ($24.99)
Context: "I'm on the checkout page at example.com ordering the widgets you requested."
[screenshot attached]

Form Filling

The browser_fill_form tool lets agents populate multiple form fields in a single action:

{
  "fields": [
    { "selector": "#name", "value": "Jane Smith" },
    { "selector": "#email", "value": "jane@example.com" },
    { "selector": "#company", "value": "Acme Corp" }
  ],
  "submit": false
}

Set submit to true to automatically click the submit button after filling all fields.

Live Viewing

When an agent is actively browsing, you can watch in real-time from the Workspace browser tab. Screenshots are streamed at ~2fps via SSE (Server-Sent Events). From the live view, you can:

See exactly what the agent sees in the browser.
Switch to Take Control mode to intervene or assist.
Navigate to a different URL using the address bar.
Save the current login state to the Session Vault.

Security

Browser sessions are secured with multiple layers:

SSRF protection — dual-stack DNS resolution blocks navigation to private IPs, metadata endpoints, and DNS rebinding domains.
Session isolation — each session runs in its own Chromium instance, scoped to a single team.
Resource limits — max 2 concurrent sessions per team (~150MB per Chromium instance).
Auto-cleanup — 10-minute idle timeout and 30-minute maximum duration.
Role-based access — only team members and admins can create browser sessions.