DolosAgent: Vision Agent Beta
Today I released Dolos Agent into public beta. Dolos is a vision-based browser agent that can follow and interpret user instructions to navigate and interact with a browser instance.
Link to Dolos: GitHub
Today I released DolosAgent into public beta. Dolos is a lightweight interactive vision-based browser agent that can follow and interpret user instructions to navigate and interact with a browser instance.
Why Dolos?
Most browser automation tools rely on brittle CSS selectors that break with every UI update. AI agents using accessibility trees struggle with dynamic JS-heavy apps. I needed something that could interact with enterprise chat interfaces (and today's modern web apps) the same way a human would, by seeing and understanding the page.
Dolos is a vision-enabled agent that uses ReAct reasoning to navigate and interact with a Chromium browser session. This is based on huggingface's smolagent reason + act architecture for iterative execution and planning cycles.
I built Dolos because I needed a lightweight intelligent tool to test corporate/enterprise chat agent guardrails.
Core features
- Vision-First Navigation - Screenshot analysis for coordinate-based clicking (no CSS selectors)
- Multi-Provider LLMs - OpenAI, Anthropic, or Google via Vercel AI
- ReAct Framework - Reason + Act loop with planning and loop detection.
- Conversational Mode - Persistent memory across multiple tasks
- Human-Like Typing - Configurable delay between keystrokes
- State Change Detection - Automatically waits for page updates
- Full Verbosity - Complete transparency into LLM reasoning and decisions
Quick Start
git clone https://github.com/randelsr/dolosagent
cd dolosagent
npm install && npm run build && npm link
# Configure API keys
cp .env.example .env
# Add your OPENAI_API_KEY or ANTHROPIC_API_KEY
# Start conversational mode
dolos chat -u "https://salesforce.com" -t "click on the ask agentforce anything button in the header, then type "hello world" and press enter"
Design/architecture notes
Dolos follows the proven ReAct framework. Here's how the ReAct loop breaks down:
- OBSERVE → Capture screenshot + extract DOM elements
- DETECT → Check if page state changed
- PLAN → Reflect on progress (every N steps)
- THINK → LLM analyzes and decides next action
- ACT → Execute tool (click, type, navigate, etc.)
- REMEMBER → Store action with actual results
- REPEAT → Until task complete or max steps
Why vision-first?
Traditional approaches use Playwright's accessibility snapshots, but these fail with delayed JavaScript interactions and dynamic content. Modern vision models (like GPT-5, Claude Sonnet, Gemini) excel at visual understanding. Dolos combines vision analysis for spatial understanding with text LLMs for decision-making. This sacrifices some cost efficiency for reliability and zero configuration.
Why no CodeAgent?
While I've always been impressed with smolagent's CodeAgent approach, I opted to experiment with the use of structured json tool calling, as this has become the industry standard with MCP and various other agent frameworks. As LLMs become more capable at instruction following, the need for the CodeAgent approach diminishes.
Key Technical Decisions
- Coordinate-based only - No CSS selectors, no XPath, pure visual clicking
- Dual-model support - Use cheap vision models (Gemini 2.5 Flash) with capable reasoning models (GPT-5) for cost optimization
- Tool result capture - Planning phase sees actual outcomes, not just "executed"
- State hashing - Detects when pages haven't updated (waiting for chat responses, etc)
Use Cases
- Testing chat agent guardrails - original motivation
- E2E testing without brittle selectors - visual regression testing
- Web scraping dynamic content - no need to reverse-engineer API calls
- Accessibility auditing - see what vision models understand
- Research & experimentation - full verbosity shows LLM decision-making
Competitors
It's worth identifying existing tools that meet, exceed or provide tangengial functionality to Dolos:
- skyvern
- browser-use
- OpenAI Operator
- Claude for Chrome
- MultiOn
What's next?
I'm actively developing Dolos and would love your feedback:
- What tasks are you automating?
- Which features would make this more useful?
- What providers/models should I prioritize?
Current Limitations
- Token costs - Vision models aren't free; dual-model setup helps
- Speed - Vision analysis + LLM reasoning takes time
- Accuracy - Coordinates can be off; visual markers help verify
- No mobile support - Chromium desktop only for now
Comms
Try it out: GitHub
Report issues: GitHub Issues