DolosAgent: Vision Agent Beta | Ryan Randels | Mobile & SaaS Strategy | Agentic AI

Link to Dolos: GitHub

Today I released DolosAgent into public beta. Dolos is a lightweight interactive vision-based browser agent that can follow and interpret user instructions to navigate and interact with a browser instance.

Why Dolos?

Most browser automation tools rely on brittle CSS selectors that break with every UI update. AI agents using accessibility trees struggle with dynamic JS-heavy apps. I needed something that could interact with enterprise chat interfaces (and today's modern web apps) the same way a human would, by seeing and understanding the page.

Dolos is a vision-enabled agent that uses ReAct reasoning to navigate and interact with a Chromium browser session. This is based on huggingface's smolagent reason + act architecture for iterative execution and planning cycles.

I built Dolos because I needed a lightweight intelligent tool to test corporate/enterprise chat agent guardrails.

Core features

Vision-First Navigation - Screenshot analysis for coordinate-based clicking (no CSS selectors)
Multi-Provider LLMs - OpenAI, Anthropic, or Google via Vercel AI
ReAct Framework - Reason + Act loop with planning and loop detection.
Conversational Mode - Persistent memory across multiple tasks
Human-Like Typing - Configurable delay between keystrokes
State Change Detection - Automatically waits for page updates
Full Verbosity - Complete transparency into LLM reasoning and decisions

Quick Start

git clone https://github.com/randelsr/dolosagent
cd dolosagent
npm install && npm run build && npm link

# Configure API keys
cp .env.example .env
# Add your OPENAI_API_KEY or ANTHROPIC_API_KEY

# Start conversational mode
dolos chat -u "https://salesforce.com" -t "click on the ask agentforce anything button in the header, then type "hello world" and press enter"

Design/architecture notes

Dolos follows the proven ReAct framework. Here's how the ReAct loop breaks down:

OBSERVE → Capture screenshot + extract DOM elements
DETECT → Check if page state changed
PLAN → Reflect on progress (every N steps)
THINK → LLM analyzes and decides next action
ACT → Execute tool (click, type, navigate, etc.)
REMEMBER → Store action with actual results
REPEAT → Until task complete or max steps

Why vision-first?

Traditional approaches use Playwright's accessibility snapshots, but these fail with delayed JavaScript interactions and dynamic content. Modern vision models (like GPT-5, Claude Sonnet, Gemini) excel at visual understanding. Dolos combines vision analysis for spatial understanding with text LLMs for decision-making. This sacrifices some cost efficiency for reliability and zero configuration.

Why no CodeAgent?

While I've always been impressed with smolagent's CodeAgent approach, I opted to experiment with the use of structured json tool calling, as this has become the industry standard with MCP and various other agent frameworks. As LLMs become more capable at instruction following, the need for the CodeAgent approach diminishes.

Key Technical Decisions

Coordinate-based only - No CSS selectors, no XPath, pure visual clicking
Dual-model support - Use cheap vision models (Gemini 2.5 Flash) with capable reasoning models (GPT-5) for cost optimization
Tool result capture - Planning phase sees actual outcomes, not just "executed"
State hashing - Detects when pages haven't updated (waiting for chat responses, etc)

Use Cases

Testing chat agent guardrails - original motivation
E2E testing without brittle selectors - visual regression testing
Web scraping dynamic content - no need to reverse-engineer API calls
Accessibility auditing - see what vision models understand
Research & experimentation - full verbosity shows LLM decision-making

Competitors

It's worth identifying existing tools that meet, exceed or provide tangengial functionality to Dolos:

skyvern
browser-use
OpenAI Operator
Claude for Chrome
MultiOn

What's next?

I'm actively developing Dolos and would love your feedback:

What tasks are you automating?
Which features would make this more useful?
What providers/models should I prioritize?

Current Limitations

Token costs - Vision models aren't free; dual-model setup helps
Speed - Vision analysis + LLM reasoning takes time
Accuracy - Coordinates can be off; visual markers help verify
No mobile support - Chromium desktop only for now

Comms

Try it out: GitHub

Report issues: GitHub Issues

Discuss: X LinkedIn