Agentic Ops
Self-Healing CI/CD with AI Agents
Part of a bigger experiment: how much of running a company can you loop through AI agents? This piece is about the CI/CD pipeline. When a build breaks, a coding agent reads the logs, writes a fix, and opens a PR. No human required.
The experiment
The bigger picture
I've been running Far Horizons as a one-person company for five years. Increasingly, the question I keep coming back to is: what if I could hand off the boring operational loops to AI agents? Not the interesting decisions, but the stuff that breaks at 2am and needs someone to look at logs and push a fix.
This is one piece of that. The self-healing pipeline. Most CI/CD pipelines end at "send a Slack notification." You read the alert, open the logs, figure out the issue, write a fix, push it, wait for CI again. That loop takes minutes to hours depending on how deep you are in something else when the notification comes in.
So I wired up a webhook that fires when post-deploy checks fail. That webhook triggers Claude Code, which has access to the GitHub Actions logs and the repo. It diagnoses the issue, writes a fix, and opens a PR with auto-merge. When CI passes, it merges. The loop closes itself.
It's not foolproof. It only works because I own the full stack and there's nobody else merging code at the same time. But when it works, it's pretty wild. Push, break, fix, merge, all while I'm making coffee.
How it works
1. Push triggers CI
GitHub Actions builds, tests, and deploys to Cloudflare Workers.
2. Post-deploy checks run
Health checks and Playwright E2E tests validate the deployment.
3. Failures fire a webhook
Any check failure sends the run URL to a coding agent.
4. Agent diagnoses and fixes
Claude Code reads the logs, identifies the issue, writes a fix, and opens a PR with auto-merge.
5. CI re-triggers
The merged PR starts the loop again. Build, deploy, validate.
Architecture
The pipeline
Hover over nodes to see details. Dashed lines show failure paths and planned integrations.
Self-Healing Pipeline
ci.yml · on: push · detect → diagnose → fix → deploy
Status
What's working today
Push to main triggers GitHub Actions: build, test, deploy to Cloudflare Workers. Migrations, media, and frontends all go out in one pipeline.
Health checks and Playwright E2E tests run after every deploy. If anything returns a non-200 or a test fails, a webhook fires.
The failure webhook triggers a coding agent (Claude Code) that reads the CI logs, diagnoses the issue, writes a fix, and opens a PR with auto-merge enabled.
Where it falls over
Current limitations
- • Auto-merge only works if nothing else merges first. The agent doesn't rebase yet
- • Only viable when you own the full stack. No shared repos, no external dependencies
- • Agent fixes are limited to what it can diagnose from logs. No runtime debugging yet
- • Human review is still recommended for non-trivial changes
- • Cost per agent invocation is non-zero. Needs monitoring at scale
What's next
Auto-rebase
Agent rebases its branch if CI fails due to merge conflicts
Runtime error integration
Sentry-style alerts trigger the same healing loop, not just CI failures
Cost dashboard
Track agent invocations, token usage, and fix success rate
Multi-repo orchestration
Coordinate fixes across frontend and backend repos
Interested?
This feeds into client work
The same approach works for error monitoring, content pipelines, and automated QA. If you're curious about what agents can do for your ops, let's talk.
Explore further
The experiments here feed directly into how I work with clients.