foxfitbehind-the-scenesai

Testing AI-Written Code

FoxFit has hundreds of tests, all passing, and the AI writes most of them. Here's our testing strategy and the session cleanup workflow that keeps quality high.

FoxFit has 356 tests, all passing. The AI writes most of them, but we review each one.

Testing is where AI-assisted development either earns trust or loses it. If you can’t verify the code works, it doesn’t matter how fast you produced it. This post explains our testing strategy and the session cleanup workflow.

Coverage

The numbers:

  • Repository.swift: 100% coverage
  • Business logic overall: 80%+
  • App overall: lower, because SwiftUI views are hard to unit test

We focused coverage on code that handles user data. The repository layer where workouts get saved, loaded, and synced is fully tested. If a bug corrupts someone’s workout history, that’s unacceptable.

AI-Generated Tests

We give Claude Code the implementation and ask it to generate tests. The output is usually good, but “usually” is doing a lot of work in that sentence, which is why we review every test it writes.

We watch for a few patterns:

  • Tests that pass without actually verifying the behaviour we care about. Green doesn’t always mean correct. At one point we had hundreds more tests than we needed and it was killing our cycle time, not to mention costing precious tokens.
  • Edge cases the AI skipped; empty inputs, nil values, boundary conditions. The happy path gets covered well, the edges less so.
  • Tests that check implementation rather than behaviour. We want behaviour, so the tests survive a refactor. This one took some dialling in.

We reject tests that don’t make sense; the AI proposes; we dispose.

Session Cleanup

At the end of each coding session, we run a cleanup prompt. It uses three roles:

RoleJob
code-simplifierRemove duplication, reduce complexity
debuggerFind edge cases and potential bugs
test-automatorRun tests, fix failures, add missing coverage

This catches issues while the context is fresh. The AI just wrote the code; it still has the full context of what it was trying to achieve. Asking it to review immediately produces better results than reviewing days later in a new session.

Test Maintenance

Tests break when code changes. The AI is actually better at fixing broken tests than writing them from scratch as it can see the old test, the new code, and can therefore figure out what needs updating.

We run tests after every change. If something breaks, we fix it before moving on. There is no broken test backlog and this discipline is what makes the pipeline sustainable. If you let broken tests accumulate, you’ve lost the safety net entirely.

Most of what’s written about AI-assisted development is about how fast you can ship. Almost none of it is about how to know whether what you shipped actually works. That’s the harder problem, and it’s the one we spend most of our time on.