Agentic development infrastructure for a software company

1-2 featuresPer sprint

Growing backlogMonths of work

Manual reviewsBottleneck

4 developersAt capacity

Dev

Before

Scope

Code

Review

Deploy

11 features

5x velocity

Quality gates

Same team

Agentic

Most teams using AI for development treat it as a faster keyboard: autocomplete, inline suggestions, chat assistants. The productivity gains are real but modest - research shows 20-40% improvements on specific tasks. This engagement took a different approach: AI agents as autonomous team members with defined roles, bounded responsibilities, and human checkpoints at critical decision points.

The challenge

Our client is a European SaaS company with a four-person development team. They had a common problem: more product ideas than engineering capacity to deliver them.

The symptoms were familiar:

Shipping 1-2 features per two-week sprint, regardless of backlog priority
Senior developers spending 40%+ of their time on code review rather than architecture
A growing backlog of well-scoped features that simply could not be reached
Hiring was slow and expensive - and would not solve the underlying throughput constraint
Quality was high, but velocity had plateaued

The team had tried AI coding assistants. They helped with autocomplete and boilerplate, but the fundamental constraint remained: every feature still required a developer to own it from start to finish. The bottleneck was not typing speed. It was the number of parallel workstreams the team could sustain.

Our approach

We designed and deployed an agentic development infrastructure: a system of specialised AI agents that handle scoping, implementation, and review - with humans providing direction at key checkpoints rather than doing the work directly.

Scoping Agent

Task breakdown · Specs

→

Coding Agents

Implementation · Tests

→

Review Agents

Quality · Security

→

Pipeline

PR · CI/CD · Deploy

Scoping Agent

• Breaks epics into implementable tasks
• Generates structured specifications
• Identifies dependencies and risks
• Human approves scope before coding

Coding Agents

• Multiple agents work in parallel
• Each agent owns one task/file
• Writes implementation + tests
• Self-validates before handoff

Review Agents

• Basic: style, patterns, tests
• Security: OWASP, secrets, injection
• Architecture: coupling, boundaries
• Flags issues for human decision

Pipeline Integration

• Auto-creates PRs with context
• Runs CI/CD on every change
• Human final approval gate
• Automated deployment on merge

The architecture

The system follows the Plan-Execute-Verify pattern that has emerged as a best practice for production agentic workflows. Deterministic orchestration controls the flow. Agents handle the creative work within bounded contexts. Automated verification catches issues before they reach human reviewers.

Four agent types work in sequence:

Scoping agents take high-level feature descriptions and produce structured specifications. They break work into implementable tasks, identify dependencies, flag risks, and estimate complexity. A human reviews and approves the scope before any code is written. This is the first checkpoint: agents propose, humans decide.

Coding agents take approved specifications and produce implementations. Multiple agents work in parallel, each owning a single task or file. They write code, generate tests, and self-validate against the specification before handing off. The key insight: agents excel at bounded problems with clear acceptance criteria. The scoping phase produces exactly that.

Review agents operate in tiers. A fast basic reviewer checks style, patterns, and test coverage. A security-focused reviewer scans for OWASP vulnerabilities, hardcoded secrets, injection risks, and dependency vulnerabilities. An architecture reviewer checks for coupling violations and boundary breaches. Issues are flagged with context and severity, not silently fixed - humans make the call on anything non-trivial.

The security reviewer proved its value immediately. In the first month of operation, it identified three critical vulnerabilities that had been present in the codebase for over a year: an SQL injection vector in a legacy search endpoint, a hardcoded API key in a configuration file that had been committed years earlier, and an insecure deserialization pattern that could have allowed remote code execution. None of these had been caught in manual code review. The security agent now scans every change before it reaches human reviewers, and flags anything matching known vulnerability patterns with full context and remediation guidance.

Pipeline integration ties everything together. Approved code is automatically packaged into pull requests with full context: the original specification, implementation notes, review findings, and test results. CI/CD runs on every change. A human provides final approval before merge. Deployment is automated.

Technology choices

The agent orchestration runs on LangGraph, chosen for its explicit state management and checkpoint support. In a system where humans need to intervene at specific points, being able to pause, inspect, and resume agent workflows is non-negotiable.

The agents themselves are built on Claude (Anthropic), selected for strong performance on code generation and the ability to follow complex specifications without drift. Structured outputs ensure agent responses conform to expected schemas at every handoff.

MCP (Model Context Protocol) provides the integration layer between agents and development tools: repository access, CI/CD triggers, PR creation, and notification routing. This avoids building custom integrations for each tool and keeps the agent logic portable.

The entire system runs as containerised services alongside the client's existing infrastructure. No external dependencies on agent hosting platforms. Full control over data residency and access patterns.

Human-in-the-loop design

The research is clear: fully autonomous AI development does not work at production quality levels. AI-generated code contains more issues than human-written code, and without review gates, those issues compound.

We designed explicit human checkpoints at two points:

Scope approval - Before any code is written, a human reviews and approves the task breakdown. This catches misunderstandings early, when they are cheap to fix.
Final review - Before any code is merged, a human reviews the PR. The difference: instead of reviewing raw code, they review code that has already passed automated quality, security, and architecture checks. The human review focuses on business logic and product decisions, not style nits.

This is not a compromise. It is the target design. Agents handle volume. Humans handle judgment.

Results

The system was deployed incrementally over eight weeks and has been running in production for four months.

8-13

Features per sprint

up from 1-2

PR throughput

same team size

Critical vulns found

in legacy code

70%

Less review time

for senior devs

The team now ships 8-13 features per sprint, up from 1-2. Over the last three sprints, they delivered 9, 13, and 11 features respectively. This is not a one-time spike: the throughput has been consistent since deployment.

Senior developers have shifted from reviewing every line of code to reviewing pre-validated PRs and making architectural decisions. Their time on code review dropped from 40%+ to under 15%. The time recovered goes into system design, technical direction, and the complex problems that still require human expertise.

The backlog that was projected to take 18 months is now on track for completion in under four months. Features that had been deprioritised for years are shipping.

Quality has improved. The automated security scanning catches issues that manual review previously missed. Beyond the three critical legacy vulnerabilities, the security agent has flagged and prevented 14 additional security issues from reaching production in the four months since deployment - issues that would likely have passed human review unnoticed. Zero security-related incidents since deployment.

Client feedback

“We went from mass-decline of product requests to actually completing our roadmap. The backlog used to feel infinite. Now we can see the end of it.”

“I thought I would spend all my time fixing agent mistakes. Instead I spend my time on architecture and hard problems. The agents handle the rest.”

Working on a similar challenge?

We build AI systems for defence and critical infrastructure clients across Northern Europe. Let's talk about what's possible for your environment.

Let's talk