Agentic Ops
Self-Healing CI/CD with AI Agents
Part of a bigger experiment in looping as much of running a company as possible through AI agents. This piece covers the CI/CD pipeline.
The experiment
Context
Far Horizons has been a one-person company since 2019 (in spirit since late 2018). I'm increasingly trying to loop as much of the operational work as possible through AI agents. CI/CD is the first piece.
Most pipelines end at "send a Slack notification." Then a human reads the alert, opens the logs, figures out the issue, writes a fix, pushes it, waits for CI.
I wired up a webhook that fires when post-deploy checks fail. That webhook triggers Claude Code, which has access to the GitHub Actions logs and the repo. It diagnoses the issue, writes a fix, and opens a PR with auto-merge. When CI passes, it merges.
It only works because I own the full stack and there's nobody else merging code at the same time. The next pieces are runtime alerts and auto-rebase.
How it works
1. Push triggers CI
GitHub Actions builds, tests, and deploys to Cloudflare Workers.
2. Post-deploy checks run
Health checks and Playwright E2E tests validate the deployment.
3. Failures fire a webhook
Any check failure sends the run URL to a coding agent.
4. Agent diagnoses and fixes
Claude Code reads the logs, identifies the issue, writes a fix, and opens a PR with auto-merge.
5. CI re-triggers
The merged PR starts the loop again. Build, deploy, validate.
Architecture
The pipeline
Hover over nodes to see details. Dashed lines show failure paths and planned integrations.
Self-Healing Pipeline
ci.yml · on: push · detect → diagnose → fix → deploy
Status
What's working today
Push to main triggers GitHub Actions: build, test, deploy to Cloudflare Workers. Migrations, media, and frontends all go out in one pipeline.
Health checks and Playwright E2E tests run after every deploy. If anything returns a non-200 or a test fails, a webhook fires.
The failure webhook triggers Claude Code, which has access to the GitHub Actions logs and the repo. It reads the logs, writes a fix, and opens a PR with auto-merge enabled.
Limitations
What doesn't work yet
- • Auto-merge only works if nothing else merges first. The agent doesn't rebase yet
- • Only viable when you own the full stack. No shared repos, no external dependencies
- • Agent fixes are limited to what it can diagnose from logs. No runtime debugging yet
- • Human review is still recommended for non-trivial changes
- • Cost per agent invocation is non-zero. Needs monitoring at scale
What's next
Auto-rebase
Agent rebases its branch if CI fails due to merge conflicts
Runtime error integration
Sentry-style alerts trigger the same healing loop, not just CI failures
Cost dashboard
Track agent invocations, token usage, and fix success rate
Multi-repo orchestration
Coordinate fixes across frontend and backend repos
Client work
Same patterns, different stack
The same approach applies to error monitoring, content pipelines, and automated QA. I test the ideas here before bringing them to client work.