Autonomous…ish: Why Two Newcomers Lapped Jules and Devin on Real Work

Autonomous…ish: Why Two Newcomers Lapped Jules and Devin on Real Work
By David Proctor • August 27, 2025
The meaningful split in AI coding assistants is simpler than you think: synchronous vs asynchronous. While synchronous tools offer pair-programming in your IDE with fast back-and-forth, asynchronous agents promise to handle scoped goals independently, planning, running, testing, and opening PRs while you focus elsewhere.
That second model is the dream: real autonomy, not just clever autocomplete. But which tools actually deliver on this promise?
Understanding Synchronous vs Asynchronous AI
One of the easiest ways to understand the difference between these approaches is to think about parenting:
Synchronous agents require your active participation. You ask, it responds, you refine, it adjusts. Like helping a young child with chores - you must guide them through each step: "pick up your shoes," "put them in the closet," "now close the door." Progress only happens while you're standing there.
Asynchronous agents offer a different experience. You set the task, agree on the plan, then step away. The agent works independently, checking in only when needed. It's like when a child gets older - you can say, "Clean your room before I get back" and trust the work will be done without constant supervision.
This shift from step-by-step guidance to delegated autonomy is what makes asynchronous AI so exciting. It frees you from being the bottleneck.
The Key Players in Asynchronous AI
GitHub Copilot Agent
Agent Mode offers expanded capabilities beyond standard Copilot
Devin (Cognition AI)
Marketed as an autonomous AI software engineer
Cline
Focused on autonomous coding capabilities
Cursor Agent
Expanded functionality beyond standard code assistance
Jules (Google)
Google's flagship asynchronous AI coding agent powered by Gemini 2.5 Pro
Genspark & AbacusAI
Later additions to the test group that delivered surprising results
This article focuses primarily on Jules, Devin, Genspark, and AbacusAI, testing their ability to deliver on the promise of true asynchronous autonomy.
The Surprising Results
The central question: could these agents truly deliver on the promise of asynchronous autonomy? The short answer is: no.
In testing, Genspark and Abacus.AI consistently outperformed Jules and Devin, not just by finishing more tasks but by delivering more polished, production-ready products.
While all four agents show potential, none are yet at a level where you could hand them a task and walk away. Instead, they require supervision similar to a synchronous chatbot, defeating the core promise of asynchronous work.
The autonomy that makes these tools exciting was almost more frustrating - asynchronous agents tended to ask more complex questions requiring more context to answer. If you're going to be asked questions anyway, they might as well be quick and easy to answer.
Benchmark Phase 1: Messy Repo Challenge
Task 1: Modernize & CI Green 🔄️
Upgrade dependencies, fix two failing tests, and resolve security vulnerabilities.
Task 2: Deflake & Harden 🛡️
Identify and fix a flaky test, and add basic input validation to an API endpoint.
Task 3: Issue-driven Feature 🚀
Implement a new feature from a GitHub issue, including API, UI, tests, and documentation.
The test began with an intentionally messy monorepo containing outdated dependencies, failing tests, and missing validation. The results were stark:
Jules & Devin: Unable to complete Tasks 1 and 2, running into environment and setup issues. They required so much handholding that the initial test was abandoned before they could attempt Task 3.
Genspark & Abacus.AI: Successfully completed Tasks 1 and 2 with only minor issues. Both agents were able to complete Task 3, building a new feature with API, UI, tests, and documentation.
Benchmark Phase 2: Feature Build (Prediction App)
After the struggles of the messy repo challenge, all four agents were given a clean-slate challenge: build a full-stack "Prediction App" from scratch. The prompt included detailed specs for a SQLite database, FastAPI API with Brier scoring, a Next.js UI, tests, and documentation.
While all four agents produced a working app, the difference in quality was massive:
Genspark and AbacusAI delivered substantially more polished and user-friendly experiences. Their designs were modern and intuitive, with clear flows and better UX. AbacusAI's ability to build and deploy the entire application within its own ecosystem was particularly impressive.
Jules and Devin both produced working apps, but the final products were rough and sparse. Jules' UI looked like "a basic form designed in Infopath 2003," while Devin's was a minimalist wireframe.
The Missing Pieces: Manual Debugging
Even with the feature build, manual debugging was required for all agents. Common issues like incorrect Docker flags and import errors still needed to be addressed, preventing a true "set-it-and-forget-it" experience.
This highlights the current limitation of asynchronous AI agents - while they can handle significant portions of development work, they still require human oversight and intervention to ensure a final, production-ready result.
The promise of truly autonomous development remains unfulfilled. For now, these agents are best seen as powerful, but not yet fully independent, collaborators.
Comparative Summary
The results from this benchmark are clear. While the field of autonomous AI agents is still in its infancy, there is already a significant performance gap between the leaders.
Genspark and AbacusAI are materially ahead of Jules and Devin in both their ability to handle complex repo-scale tasks and in their final product quality - a surprising result for the author.
They are capable of executing more robust, end-to-end changes with a higher degree of autonomy, though still not at the level of true "set it and forget it" development.
Conclusion: The Future of AI Development
Current State
The core promise of truly asynchronous development remains unfulfilled. All agents still require human oversight and intervention to ensure a final, production-ready result.
Clear Leaders
Genspark and AbacusAI demonstrated significantly better performance than Jules and Devin, both in handling complex tasks and delivering quality results.
Future Potential
This field will likely evolve rapidly. Today's agents are powerful collaborators, but not yet the fully autonomous developers promised in marketing materials.
For those interested in exploring further, all repositories from the benchmark are publicly available:
Jules: Messy repo | Prediction app
Devin: Messy repo | Prediction app
Genspark: Messy repo | Prediction app
AbacusAI: Messy repo | Prediction app