TL;DR
  • Claude Code and OpenAI Codex are agentic coding tools, not autocomplete - they read your repo, plan changes, run tests, and propose diffs end to end
  • For small engineering teams, the real gain is in repetitive work: scaffolding, refactors, test backfill, documentation, and bug investigation
  • The trap is treating them as faster typists - the gain only lands when you redesign the review and merge workflow around agent-authored changes
  • Adoption needs guardrails: branch policies, secret scanning, allowed-tool lists, cost ceilings, and a clear human approval gate before merge
  • Most small teams can be running a useful agentic workflow within a week, with measurable uplift inside a month
2-4x PR throughput uplift teams report on small, well-tested repos after adopting agentic tools deliberately
Hours Not days, for green-field prototypes and one-off internal tools once an agent is wired into the repo
£0 Additional per-seat licence cost - agentic tools bill on token usage, not seats, so spend scales with work not headcount

Most engineering teams of two to ten people are still using AI the same way they were in 2023: a chat window open in a separate tab, a copy-paste workflow between IDE and model, occasional autocomplete suggestions in the editor. It is useful. It is also a small fraction of what the current generation of tools can do, and the gap is widening every quarter.

Claude Code and OpenAI Codex are not faster autocomplete. They are agentic coding tools: command-line agents that read your repository, plan a change, edit files, run tests, and iterate on failures - all inside a session that knows your codebase. For a small team, the practical effect is that one engineer can ship the kind of structured change that used to require a small pull request train: a refactor across thirty files, a test backfill for an under-tested module, a documentation pass that actually matches the code.

The gap between "I tried it for an afternoon and it was fine" and "we ship measurably more with the same headcount" is not the model. It is the workflow around the model. This piece is the honest version of what these tools are, where they earn their keep for small engineering teams, where they fall over, and how to roll them out without trashing your code review culture.

What "agentic" actually means

An agentic coding tool is one that can take a goal - "add pagination to the customer list endpoint and update the tests" - and execute the loop on its own: read the relevant files, draft the change, run the test suite, read the failures, fix them, and present a diff for review. It is not generating a single completion and handing it back to you. It is taking turns with a set of tools (file read, file edit, shell, test runner, git) until the goal is met or the budget runs out.

This matters for two reasons. First, a lot of engineering time is spent in the loop itself - waiting on tests, reading errors, making small follow-up edits. The agent absorbs that loop. Second, an agent that can read the whole repo before it writes anything makes far fewer of the contextless mistakes that early AI coding tools were famous for: invented function names, wrong import paths, confident misreadings of an interface. The agent looks first.

The two tools worth most teams' attention right now are Claude Code (Anthropic's CLI, designed to be embedded in a developer's terminal and tied into hooks, custom agents, and MCP servers) and OpenAI Codex (the current cloud and CLI agent from OpenAI, optimised for long-running tasks and parallel work). The exact feature surface shifts quarterly; what is stable is the shape: a terminal-first agent that operates on your repo with your authorisation.

Where Claude Code earns its keep

Claude Code is at its strongest for work that benefits from sitting close to the developer and being shaped by the team's conventions. The CLI runs in your terminal, picks up a CLAUDE.md file in the repository root for project-specific instructions, and supports hooks that let you wire in pre- and post-tool checks. For a small team this is the right shape: the tool inherits your standards instead of fighting them.

Strong fits for Claude Code in our experience:

  • Repository-wide refactors: rename a concept across the codebase, migrate a deprecated API, restructure module boundaries. The agent reads the full graph before it changes anything.
  • Test backfill: point it at an under-tested module with the existing test conventions and ask for coverage. Output quality tracks the quality of your existing tests, which is itself a forcing function.
  • Bug investigation: give it a stack trace or a reproduction and let it crawl the repo. It is faster than a human at the boring parts of bisecting and consistently surfaces the right two or three files.
  • Documentation that matches the code: README sections, runbooks, ADRs. Because it reads the code first, the output does not drift the way hand-written docs do.
  • Internal tooling and one-off scripts: the kind of work small teams under-invest in because the activation energy is too high relative to the payoff.

Where Codex earns its keep

OpenAI Codex shines on a different axis. The cloud agent is built for longer-running, more parallelisable work - you can hand it a task, leave it running, and come back to a proposed change. For teams that have well-described tickets and a healthy test suite, this is a real shift: the agent does the first pass while the engineer is in a meeting or on a different task. The CLI variant covers the in-terminal use case similar to Claude Code.

Strong fits for Codex:

  • Long-running tasks with clear acceptance criteria: a feature ticket with a written spec and existing tests. The agent can grind on it for an hour without supervision.
  • Parallel investigation: running three variants of a refactor in parallel and comparing diffs, instead of doing them one by one.
  • Code review assistance on incoming PRs: a structured pass over a diff against the repo conventions before a human looks.
  • Cross-language work: repos that span TypeScript, Python, and infrastructure-as-code where the cost of human context-switching is high.

Most teams that use one end up using the other for different work. They are not direct substitutes; they are different shapes of the same primitive.

The trap: treating it like autocomplete

1

From autocomplete to agent

Autocomplete mindset
  • Engineer writes the change themselves, accepting inline suggestions as they type
  • Tests are run by hand; failures are read by the human and fixed manually
  • Context is pasted into a chat window; the model never sees the rest of the repo
  • Review workflow unchanged - each small PR is reviewed line by line
  • Gain is real but small: 10-20% faster typing, no shift in what is shippable
Agentic mindset
  • Engineer writes a brief: goal, constraints, files to touch, acceptance criteria
  • Agent reads the repo, drafts the change, runs the tests, fixes failures, presents a diff
  • Engineer reviews the diff as they would a junior colleague's PR - reading for intent, not for typos
  • Review workflow shifts: more time on design and acceptance, less time on syntax
  • Gain is structural: a single engineer can take on changes that would have needed a small team
Realised gain: 2-4x PR throughput on small repos with strong tests; near zero gain on repos without them
Honest version: if your team's instinct is to count seats, you are buying the wrong thing. Claude Code and Codex bill on token usage. A two-engineer team running them hard can spend less per month than a single GitHub Copilot enterprise seat - or more, if the work is heavy. Spend tracks usefulness, not headcount. That is the point.

The most common failure mode we see is teams adopting an agentic tool and then continuing to use it as a glorified autocomplete. The engineer sits at the keyboard, asks for a function, accepts the suggestion, runs the test by hand. All the real gain is left on the table. The shift that matters is letting the agent take the whole loop - including the parts that feel uncomfortable to delegate, like running the test suite or rewriting a file you would have written yourself.

Adoption without the chaos

The reasonable fear from any engineering leader is that an agentic tool, given the keys to the repo, will produce changes that look fine on the surface and are subtly wrong underneath - committed to main at three in the morning by an over-eager teammate, discovered a week later. The fear is legitimate. The mitigation is not "ban the tool"; it is to put the same kind of guardrails around it that you already have around any engineer with commit access.

The minimum viable adoption checklist for a small team:

  • Branch protection on main: no direct pushes, required reviews on every PR. Agent-authored PRs go through the same gate as human-authored ones. This is non-negotiable and is usually already in place.
  • An explicit human approval gate: the agent proposes a diff, a human reads and approves it before it hits CI. The agent is a colleague, not a deploy pipeline.
  • Allowed-tool lists: configure the agent so it can read files, write to a branch, and run tests - but cannot, for example, push to main, delete branches, or execute arbitrary shell commands without confirmation. Both Claude Code and Codex support this.
  • Secret scanning and pre-commit hooks: the same hooks that protect against a human accidentally committing a secret protect against an agent doing it. If you do not have these, fix that first.
  • Cost ceilings and observability: a monthly token budget per repo, alerts on unusual spend, and a dashboard the team actually looks at. Agentic spend is unpredictable in the first month - that is normal, but it needs to be visible.
  • A written agent policy: a short document in the repo (a CLAUDE.md or equivalent) describing what work is in scope for the agent, what is not, and what conventions to follow. Treat it as part of onboarding.
  • A monthly review: for the first quarter, look back at the merged PRs and ask which were agent-authored, where they helped, and where they did not. Adjust scope based on evidence, not vibes.

None of this is unusual. It is the same hygiene that mature teams already practise for human contributors. The work of adoption is mostly the work of being deliberate about a tool you would otherwise let in through the back door, one engineer at a time, with no shared standards.

What this looks like for a small team

For a five-person engineering team in Hamilton or East Kilbride, a realistic rollout looks like this. Week one: one engineer pilots Claude Code on a single repository with the guardrails above; the team agrees on what work is in scope. Week two: the pilot engineer demos two or three merged changes to the rest of the team and writes the first version of the repo's agent policy. Weeks three and four: the rest of the team starts using the agent for their own work, with a weekly retrospective on what is working. By the end of the month, the workflow is normalised and the team has measurable evidence of the uplift - or, sometimes, evidence that a particular repo or codebase is a bad fit, which is also valuable.

The teams that get the most out of this are not the ones with the largest budgets. They are the ones with the cleanest tests, the clearest review culture, and the willingness to redesign their workflow rather than bolt new tooling on top of old habits. If that sounds like your team, the upside is large and the activation cost is low.

Common questions

Will the agent see proprietary code? Yes - that is how it works. Both Anthropic and OpenAI offer enterprise terms that exclude API content from training, with retention windows and data residency options. For sensitive repositories, configure the agent to use those terms before you connect it. For genuinely high-sensitivity work (regulated data, customer PII), scope the agent to repositories that do not contain it.

What about hallucinated code? Less of a problem than it was, but not zero. The mitigation is structural: tests run before the agent claims a task is done, code review before the diff lands. If your test suite is weak, the agent will surface that fact quickly. Some teams find the forcing function valuable.

Should we still use GitHub Copilot? They solve different problems. Copilot lives in the editor and accelerates the work an engineer is already doing. Claude Code and Codex take over whole tasks. Many teams run both. The question is not "which one" - it is whether the team has redesigned its workflow to get the agentic gain, or is still using both as autocomplete.

Do we need to be on a particular stack? No. Both tools are language-agnostic and work on any repo a competent engineer can. The repos that gain the most are the ones with good tests and a clear structure, regardless of language.

Getting started

The fastest useful step is to pick one engineer, one repository, and one well-scoped change - a refactor or a test backfill is ideal - and run Claude Code or Codex end to end on it for an afternoon. The point is not to ship the change; the point is to see the workflow honestly. After one afternoon you will know whether the bottleneck on adoption is the tool, the repo, or the team's review culture.

For teams that want help doing this properly - the training, the guardrails, the agent policy, the cost ceilings, and the follow-up review of measurable uplift - our Engineering productivity service is designed for exactly that engagement. We sit alongside your team for the first month, codify the workflow in your repo, and leave you with a working agentic setup that compounds instead of evaporating. Get in touch and we will start with one repository.