Ratchet: How I Let AI Optimize My Website 1% at a Time
Ratchet is an AI skill that runs autonomous experiments on your website — measuring metrics, making small changes, and keeping only what works. Here's how it works.
You've probably seen the hype around Andrej Karpathy's auto-research work. There's been a lot of breathless coverage about it, and I think most people have missed the actual point. The headline is about model training, sure, but the underlying principle — set up an automated loop that measures, experiments, evaluates, and iterates — maps perfectly to things that have nothing to do with training neural networks. Things like, say, making your website slightly less terrible every week.
I've been doing exactly this with my own website for a while now, and I built a Claude Code skill called Ratchet to formalize the whole process. The name comes from the idea that you're clicking forward one notch at a time — aiming for maybe 1% improvement per week. Not revolutionary. Not a redesign. Just a disciplined, data-driven ratchet that only ever moves in one direction: better.
The Problem With How We Usually Optimize Websites
Here's the thing about website optimization: most of us are terrible at it because we do it based on vibes. You read a blog post about how question-based titles get more clicks, so you rewrite all your titles as questions. You see someone on Twitter say that shorter meta descriptions perform better, so you hack yours down. Maybe it works, maybe it doesn't — you'll never actually know because you didn't measure anything before or after.
Even when people do A/B testing properly, it's usually a manual, one-off affair. You set up a test, wait two weeks, look at the results, make a decision, and then... don't do another test for three months because the whole process was exhausting.
What I wanted was something that would just keep going. An automated loop that pulls metrics, identifies the biggest opportunities, proposes an experiment, runs it, measures the results, and then either keeps the change or reverts it. No manual intervention needed except for the occasional sanity check on what it's proposing.
How the Ratchet Loop Works
The core loop is dead simple:
- Pull metrics — grab current data from Google Analytics, Search Console, newsletter signups, whatever sources you've configured
- Score pages by opportunity — rank every page by how much room it has to improve relative to your best performers
- Propose an experiment — target the highest-opportunity page with a specific, isolated change
- Execute the change — make it, deploy it, start the clock
- Wait and measure — collect data for the evaluation window (usually 7-21 days)
- Keep or discard — if the primary metric improved past the threshold, keep it. If not,
git revertand move on
That's it. The ratchet only clicks forward. Changes that don't prove themselves get reverted. Over time, your site accumulates only proven improvements.
The clever bit is the opportunity scoring. Not all pages are equal — a high-traffic page with mediocre engagement is a much better experiment target than a low-traffic page that's already performing well. The formula is roughly:
opportunity = traffic_weight * (engagement_gap + conversion_gap)
Where the "gaps" measure how far below your top-quartile performance each page sits. This means experiments automatically target the pages where even small improvements compound across the most sessions.
The Experiment Format (Where It Gets Interesting)
The part of Ratchet I'm most pleased with is the experiment format. Every experiment lives in its own directory with a specific set of files, and this structure is what makes the whole thing actually work over long time horizons.
docs/ratchet/experiments/001-homepage-title-test/
├── program.md # Hypothesis, exact changes, rationale
├── metrics.md # What to measure, success thresholds
├── guardrails.md # What can and cannot change
├── baseline.json # Pre-change metric snapshot
├── result.json # Post-change snapshot (after eval)
└── decision.md # Keep/discard rationale (after decision)
Each of these files serves a specific purpose, and the key insight is that they force you (well, the AI) to commit to decisions before seeing results.
program.md is where the hypothesis lives. It says "I think changing the SEO title of this page from X to Y will improve click-through rate because Z." It records the exact changes being made — old value to new value — so anyone can understand and reproduce the experiment later. No hypothesis, no experiment. This prevents the AI from just randomly tweaking things and hoping for the best.
metrics.md defines success criteria upfront. What's the primary metric? What threshold counts as a win? What secondary metrics need to not degrade? By locking this down before the experiment runs, you prevent the post-hoc rationalization that kills most optimization efforts. You know the one: "Well, CTR went down, but time-on-page went up, so actually this was a success!" No. You defined success before you started. Stick to it.
guardrails.md is where the blast radius gets scoped. This is crucial when you're running multiple experiments on the same page — it declares which specific fields this experiment owns and which ones it can't touch. Two experiments can run on the same page as long as one is changing the title and the other is changing the CTA text. But if they both want to modify the title? That's a conflict, and the guardrails catch it before it happens.
Why Everything Lives in the Filesystem
This might seem overly structured for what is essentially "change a title and see what happens," but there's a reason everything lives in the repo as committed files rather than in a database or some external tool.
Website optimization experiments run on the scale of weeks. My site gets about 20,000 visitors a month, which isn't huge, so I need meaningful evaluation windows — often 7 to 14 days — before I have enough data to make a decision. That's a long time in AI-agent terms. The agent that started the experiment won't be the same session that evaluates it. The context window that proposed the change is long gone by the time the results come in.
By putting everything in the filesystem and committing it to git, you get a few critical things for free:
The experiment state survives across sessions. Any future invocation of the Ratchet skill can pick up where the last one left off — it just reads the experiment directory, checks the status, and knows exactly what's running, what needs evaluation, and what the success criteria are.
You get a complete audit trail via git history. Every change is a commit. Every revert is a commit. You can trace the entire history of what was tried, what worked, and what didn't.
And maybe most importantly, it's reviewable. When the AI proposes an experiment, the program.md, metrics.md, and guardrails.md are all sitting there in a PR for you to read. You can check that the hypothesis makes sense, that the guardrails are sane, and that the success criteria are reasonable — all before the experiment starts.
What the AI Actually Changes
I should be clear about what kinds of changes we're talking about here. This isn't the AI rewriting your entire homepage or redesigning your navigation. The changes are small and isolated — that's the whole point.
The most common experiments I've been running are things like:
Adjusting SEO titles and meta descriptions to test different keyword placements or emotional hooks. Changing the text on call-to-action buttons. Reordering sections within a post to put the most engaging content higher. Adding internal links to related content. Tweaking heading text to better match search intent.
For these text-only metadata changes, the AI can act autonomously — it proposes the change, commits it to a branch, and deploys. But for anything that touches layout, component positioning, or actual code, it stops and asks for approval first. There's a clear autonomy boundary baked into the system: low-risk text changes are auto-approved, structural changes require a human in the loop.
The Connection to Auto-Research
This is why I think the hype around Karpathy's auto-research is both completely justified and hilariously misunderstood. The technique isn't special to model training. The pattern is universal:
- Define what you're trying to improve
- Measure where you currently stand
- Make a targeted change based on a hypothesis
- Measure the result
- Keep improvements, discard regressions
- Repeat
That's the scientific method with a git revert safety net. The AI just happens to be really good at proposing hypotheses, executing changes, and doing the tedious work of pulling metrics and comparing them. The part that actually matters — the structured experiment format, the pre-committed success criteria, the guardrails — is all just disciplined engineering practice.
What makes auto-research exciting isn't that AI can do experiments. It's that AI can do experiments continuously and patiently in a way that humans never will. I was never going to manually run a new SEO experiment every week for six months. But an AI skill that does it on a schedule, with proper baselines and evaluation windows and revert capabilities? That'll just keep going indefinitely, accumulating tiny wins.
Getting Started With Ratchet
The skill is open source and available in my skills repo. It's designed to be project-agnostic — all the project-specific configuration (which analytics platform you use, what your page types are, what conversion means for your site) lives in your project repo, not in the skill itself.
Running /ratchet init walks you through setup:
- Configure your analytics sources — GA4, Search Console, Plausible, whatever you're using. The skill needs to know how to pull metrics.
- Define your page types — not all pages are the same. A product review page has different success metrics than a documentation page or a landing page.
- Set up opportunity scoring — define what engagement and conversion mean for each page type, so the scoring formula knows how to rank opportunities.
After init, the directory structure gets scaffolded:
docs/ratchet/
├── config/
│ ├── api-access.md # Your analytics endpoints
│ └── opportunity-scoring.md # Page types and scoring weights
├── guardrails/
│ ├── content-only.md # Text-only change tier
│ ├── content-and-layout.md # Text + layout changes
│ └── structural.md # Full structural changes
├── experiments/ # Where experiments live
└── results.json # Historical results ledger
Then you just run /ratchet and let it do its thing. It pulls metrics, scores pages, proposes an experiment, and either auto-executes or waits for your approval depending on the change type.
Will This Actually Work?
Honestly, I don't know yet — and that's kind of the point. The whole system is designed around the premise that you don't have to know in advance what will work. You just have to be disciplined about measuring, hypothesizing, testing, and accepting the data's verdict.
What I can tell you is that the structure itself has been valuable even beyond the optimization results. Having a results.json ledger that records every experiment, its hypothesis, and whether it worked builds up a genuinely useful body of knowledge about what moves the needle for your specific site. After a dozen experiments, you start to see patterns — maybe keyword-focused titles consistently outperform clever ones, or maybe internal links in the first paragraph drive more engagement than links in the conclusion. That institutional knowledge compounds.
The real question is whether this could work at scale — on a site with significantly more traffic and more complex page types. I think the answer is yes, with some caveats. More traffic actually makes it easier because your evaluation windows can be shorter. But you'd need more sophisticated guardrails, tighter autonomy boundaries, and probably a more nuanced scoring model. The fundamental loop doesn't change though.
If you try it out, I genuinely want to hear about it. Poke holes in it. Tell me why it won't work for your use case. That's how the ratchet clicks forward — not just for the website, but for the tool itself.
