Twenty Years of Stacking Commits

In twenty years, I’ve used four different code review tools. Each was supposed to fix the previous one. None of them changed the fundamental question they were trying to answer: what is the right unit to review?

The answer hasn’t moved. The world around it has.

I started on git format-patch and git send-email in 2006, threading patches into mailing lists the way the Linux kernel did. I moved to Gerrit at OpenStack when it arrived. I tried to bring its workflow to GitHub myself, and failed. I watched GitHub turn every branch into a wall. And now I’m watching agents push code faster than humans can read it. Each era taught me the same thing: the unit of review is the commit. We just spent a decade pretending otherwise.

A 2010 patch email titled '[PATCH 2/5] Update test suite for xcb >= 1.6' on the XCB mailing list, signed by Julien Danjou, with a unified diff inline. [PATCH 2/5] from a patch series I sent to the XCB mailing list in 2010. Each patch was its own review unit. Stacking has been here the whole time.

Gerrit got it right

OpenStack ran on Gerrit. Google had built it in 2008 for Android, the brainchild of Shawn Pearce as a fork of Guido van Rossum’s Rietveld. Projects looking for serious code review followed.

Every commit you pushed got a Change-Id, a SHA-like footer Gerrit used to track that commit across rebases. One Change-Id was one review. Push six commits, get six review threads. Amend one and the others stayed untouched. Reviewers commented on the commit they cared about and ignored the rest.

The UX was hostile. You pushed code with git push HEAD:refs/for/master, a syntax that felt designed to keep newcomers out. Discussions happened in a web UI that loaded slowly even on fiber, or over a JSON-over-SSH API for the brave. You clicked through three screens to leave a comment. Senior contributors had Vim macros to make it bearable. Newcomers gave up.

But the model underneath was right. Each commit was a thought, and each thought got reviewed on its own terms. When you submitted a five-commit feature, reviewers could approve the refactoring commits while still arguing about the new abstraction in the last one. The work moved forward in pieces. Nobody waited for everything to be perfect before anything could land.

Compare this to what most teams do now. A reviewer opens a 1,500-line PR, scrolls for a while, picks a few things to comment on, and either approves or sends it back as a single block. There’s no granularity because the unit doesn’t allow it.

Gerrit knew this in 2008. It just didn’t know how to make people enjoy it.

git-review proved the UX was the wall

OpenStack built git-review to wrap Gerrit’s six-step push ritual in one command. git review did the right things in the right order. Fetch the latest target branch, rebase your commits on top, validate Change-Ids were present, push to the magic ref, print the URLs of the resulting reviews. That was it.

It got adopted across OpenStack within months. Within a year it was the default step in every contributor tutorial. Not because the tool was clever (it wasn’t), but because the friction it removed was the difference between people contributing and people giving up after their first push.

That was the lesson I carried out of OpenStack. A tool can have the right semantics and still lose if the UX is hostile. UX isn’t decoration. It’s part of correctness. The model that wins is the one a tired developer at 6pm can use without thinking. Gerrit’s model was right, but its surface area pushed everyone toward GitHub the moment GitHub got good enough.

GitHub turned every branch into a wall

GitHub got good enough around 2014. Faster UI, easier setup, and a model anyone could understand: one branch, one pull request. That model worked for solo contributions and small features. It collapsed the moment you tried to ship anything in pieces.

The problem is structural. GitHub’s PR is anchored to a branch. If you want five reviewable units, you need five branches. Each one rebased on the previous. Each one’s PR retargeted when the one below it lands. If the bottom of the stack changes, you rebase the world. Nobody does this. People give up and ship one giant PR instead, or they ship one PR per day for a week and pretend it’s fine.

I ranted about exactly this in 2013, back when most of the industry was busy convincing itself GitHub had solved code review. The argument hasn’t aged. GitHub’s PR model fights you the moment you try to ship more than one thought at a time.

I tried to fix it myself in 2016 with git-pull-request, a small Python CLI that turned commits on a single branch into chained GitHub PRs. It worked for me. It didn’t go anywhere. A one-person CLI can’t paper over a structural mismatch. The branching, the chaining, the dependency tracking, the smart updates when you amend a commit mid-stack, all of that has to live somewhere both halves of the workflow agree on. Client-side scripts work on your machine and break on someone else’s.

Years later we built Mergify Stack because the problem hadn’t gone away and the team was bigger than just me. One local branch, every commit becomes its own PR, the tool handles the chaining so amending mid-stack only updates the affected reviews.

A Mergify Stack comment in a GitHub pull request, showing a 9-row table where each row is a chained PR in the stack, with PR numbers and titles for a Rust port project. A nine-PR stack from our ongoing Rust port, posted as a comment on every PR in the chain. Each row is one commit, one PR, one review unit. The arrow points at the PR you’re currently looking at.

We weren’t the only ones who noticed the gap. Graphite bet a whole company on it, building an IDE around stacked diffs that effectively replaces GitHub’s review surface. Meta open-sourced Sapling’s ghstack. Google open-sourced jj, a Git-compatible VCS where stacks are first-class. There’s spr, stack-pr, half a dozen others. Different shapes, same admission: GitHub’s PR model doesn’t fit how serious teams ship.

My day-to-day hasn’t really changed since I was patching Xorg, or awesome, or filing bugs against Debian packages I maintained. One branch. Commit small. Interactive rebase to fix something three commits down. mergify stack push to send the whole stack up. The only difference is GitHub showing seven separate review threads instead of a mailing list showing six.

This was the missing layer. Stack semantics, GitHub surface. For a while, that was enough.

And then the agents arrived

Until last year, the bottleneck on a software team was the human writing the code. A senior engineer could ship maybe two or three thoughtful PRs in a day. Junior engineers shipped less. Reviewers mostly kept up. The whole pipeline was sized for human output.

That bottleneck is gone. With Claude Code in front of me, I can produce in an afternoon what used to take a week. Not because I’m typing faster. Because I’m not typing at all. I’m planning, briefing, reviewing, redirecting. The agents do the typing. They don’t get tired.

I’ve written about how this shifts the bottleneck downstream. The reviewer is now the choke point, and the reviewer’s tools haven’t kept up. They’re still being asked to read 1,500-line PRs the way they did when those PRs took someone a week to write. They skim. They miss things. They approve and hope.

Stacks don’t shrink the work. They make partial progress possible. A reviewer who can’t approve all of it can still approve half of it. Two reviewers can split the stack and work in parallel. A bug found in PR five doesn’t roll back PR one. The total volume is unchanged. The blast radius and the cognitive load per review are not.

A couple of weeks ago I shipped an API surface refactor across our backend and frontend. I had a plan with seven steps: add a new /merge-queue/branches endpoint, remove a deprecated old one, migrate the web app’s branch selector, drop a dead related surface, add a sister /stats/branches endpoint, migrate the stats filter, deprecate the old union endpoint. I briefed the agents. I reviewed the work. Six seconds after I ran mergify stack push, GitHub had seven open pull requests linked in dependency order.

Combined, those PRs touched roughly seven thousand lines across seventy-five file changes. As a single PR, no human would have read it carefully. As a stack, each one was a single logical move. The first landed two days later, when the lowest-risk piece (deleting the deprecated endpoint) was approved. One contested API addition cycled through a couple of rounds of dismissed reviews before landing. The last merged nine days after the push. Five different reviewers touched the stack across that window, picking up whichever links they had context for. Nobody had to load the whole refactor into their head.

Try that as one PR and tell me what happens. Seven thousand lines mixing four backend moves with two frontend migrations. A reviewer who needs to understand all of it before approving any of it. Anything wrong anywhere triggers a full revert. The work either ships as a brick or it doesn’t ship at all.

Stacks aren’t free and they aren’t universal. Exploratory work that changes shape mid-stream wants a single branch, not seven. Tightly coupled changes where commit five only makes sense once you’ve read commit four don’t decompose neatly. Bug fixes are usually one PR. The skill is knowing which kind of work you’re doing before you start. If your team can’t write atomic commits or doesn’t trust interactive rebase, stacks make things worse, not better. The model demands discipline. The bet is that AI removes enough of the writing tax that the discipline becomes affordable.

Gerrit knew this in 2008. The unit of review is the commit. The PR was always a workaround for tooling that couldn’t track commits properly. It took agents writing code faster than humans can read it for the rest of the industry to admit it.

In January, GitHub stopped denying it. They announced native Stacked PRs in private preview, with the gh stack CLI and an AI agent integration shipped as a skill.

GitHub's stack navigator UI: a vertical sidebar listing the PRs in a stack with their statuses, attached to a pull request page. GitHub’s own stack-navigator preview from the gh-stack landing page. The PR map on the right is what reviewers have been asking for since 2014.

The PR model GitHub spent a decade enforcing as the right unit of review is being quietly retrofitted to admit it wasn’t. They couldn’t say so until everyone else had already built the layer they should have shipped in 2014.

The tools change every five years. Gerrit. git-review. Mergify Stack. Graphite. GitHub Stacked PRs. Whatever ships next. Each generation looks like a step forward and most of them are.

The unit of review hasn’t moved in twenty years. It’s the commit. It always was. We just had to wait for the rest of the pipeline to break before we noticed.