Much has changed in the software development world since I was a product manager in the early aughts in Empirix’s web test and monitoring business (acquired by Oracle in 2006). But much has stayed the same: the pressure to innovate faster while reducing costs remains relentless.
For developers, the latest AI agent workflows are advancing beyond the capabilities of AI assistants to transform requirements definition, code generation, unit testing, and every other phase of the development cycle. Adopting these tools can reduce toil and enable development teams to achieve new levels of speed and quality. At the same time, no single product is a perfect fit for every task in the SDLC.
Large language model–based code assistants like Copilot have captured enormous mindshare in the last two years, not to mention a significant chunk of IT spending among early adopters. As technology leaders start to assess the return on their LLM investments, the feedback is mixed. For some users, there is a non-trivial productivity bump. But real-world reports of genuine transformation seem scarce, and the mileage varies across seniority levels, with improvements for more experienced developers particularly underwhelming. This growing realization is leading many companies to take a more nuanced view of their AI tooling needs, and they’re increasingly considering a best-of-breed approach incorporating specialized tools where warranted. Bring on the agents!
With Gartner declaring that Agentic AI is number one on their list of Top 10 Strategic Technology Trends for 2025, you can be sure that the marketing departments in every software company are feverishly trying to catch the wave, even if all they have to offer is a bunch of Excel macros. Those are agents, right? Wrong. Let’s clarify.
AI Assistants vs. AI Agents
When it comes to AI applied to writing code, the core distinction between assistants and agents is similar to the differences between collaboration and delegation. While both modes of operation can create value when applied to the appropriate challenge, they are entirely different experiences with different resource requirements. Most importantly, the value creation opportunities from fully delegating large swaths of work to a trustworthy, competent, autonomous agent will dwarf what collaboration with AI assistants can deliver.
AI assistants are helpful for various day-to-day coding tasks, offering real-time support and information. These tools are designed to collaborate with developers, providing suggestions as they code, integrating seamlessly into popular Integrated Development Environments (IDEs), and offering features that enhance the coding experience. Some key use cases for AI assistants in development include code completion, error detection, documentation lookup, and refactoring suggestions.
AI agents, meanwhile, can excel in automating complex processes and making high-level decisions almost entirely unsupervised; revolutionizing how projects are managed and executed. Designed to perform complex tasks with no human intervention, AI agents take automation a step further and enable full delegation.
However, just as in the analog world, delegation only works when the agent being delegated to is competent and trustworthy.
Assistants vs. Assistants: A Case Study
Let’s explore a real-world use case to illustrate the differences between AI assistants and agents and, in doing so, shine a bright light on what it means to collaborate versus delegate.
Unit tests. Generally loathed by developers, unit tests are still widely regarded as the foundation of an effective software quality assurance regime. Writing them can often consume a quarter to a third of a developer’s coding cycles, and the effort is generally regarded as scutwork. The allure of reducing the time spent developing unit tests is strong, and many developers with access to LLM-based coding assistants understandably want to outsource this task to them. Does it work in practice?
The answer is sometimes. But it can also fail quite spectacularly in some scenarios.
Researchers at Diffblue (full disclosure: I’m CEO there) took three open source projects representative of real-world applications and used an LLM-based coding assistant (GitHub Copilot) to generate unit tests for them. They came away with the following three observations:
- It beats the status quo: The rule of thumb is that it takes roughly 15 minutes of manual development to create a solid unit test. Of course, your mileage will vary depending on the skill of the developer and the complexity of the code being tested, but 15 minutes per test is a reasonable estimate. By collaborating with Copilot, the researchers developed a test every 26 seconds. On the surface, that is a significant productivity improvement. Put a pin in that.
- It’s still real work. Collaborating with Copilot required our developer/researcher to stay continuously engaged in an iterative, interactive, fully attended process. After prompting the tool to generate tests at the method or class level, the output had to be evaluated, executed, tweaked, retested, and eventually accepted by the user.
- There’s still plenty of scutwork: The study found that depending on the project, between 30% to 45% of the LLM-generated unit tests failed to compile and/or pass. As a result, the user had to intervene to fix them manually, thus getting dragged back into the toil that they were striving to avoid in the first place.
When applied to the same challenge, an effective agent can deliver categorically different results. Specifically, using Diffblue’s unit test writing agent on these three open source projects delivered an entirely different experience and outcome.
- First, the developer could truly delegate the task of generating a complete set of unit tests for each project. He pointed the agent at the relevant repositories, said “Go,” and let the agent do its thing for a few hours. There was no need to interact with the agent or monitor its output continuously. Instead, the developer was freed up to work on other priorities.
- Second, the resulting output from the agent was both quantitatively and qualitatively superior. Because of the underlying generative AI technology being used (reinforcement learning coupled with static analysis and dynamic code execution), every test created was guaranteed to compile and pass, eliminating the need for tedious fiddling with the tests to make them work that was a defining part of the Copilot experience. The tests were better quality than the Copilot ones, achieving average mutation testing scores of 68% vs. 63%.
- Third, the total lines of code covered by tests generated by the agent were four times that of the developer using Copilot when both approaches were given the same amount of time. But the agent’s productivity edge balloons to a 26x advantage when you consider the fact that it will operate around the clock every day of the year, whereas a developer can only reasonably be expected to write tests for six hours per working day, go on vacations, and take sick days. This massive productivity advantage doesn’t even consider that many developers would probably quit if asked to do nothing but write unit tests every day.
Conclusion
LLM-based code assistants are a tremendous innovation with a wide range of value-creating applications. However, they also have some significant limitations in certain use cases. The imprecisions of LLMs — and sometimes their outright inaccuracies — can be mitigated by sustained developer engagement in analyzing and tuning the outputs. When full automation of a broad array of tasks is the goal, the assistant model cannot scale. This is where delegating to agents has the opportunity to deliver transformative value, and this value is only fully realized if the agent’s output is consistently trustworthy. Determinism, transparency, and accuracy will become the new currency of agents who earn the right to become our delegates.
The post Delegating vs. Collaborating in the Era of AI-Powered Software Development appeared first on The New Stack.