The Challenges of Merge Queues

Why They’re Hard and Why We’re Simplifying Them

Jul 23, 2024

Merge queues are a tough concept to grasp, and over the last five years at Mergify, we've spent countless hours educating developers about their importance and utility. We've published numerous blog posts, written extensive documentation, and even gone to conferences to teach software engineers what a merge queue is. This process of spreading awareness has been a rewarding yet challenging endeavor.

One of our developers, Charly Laurent, gave an insightful talk on the subject, highlighting how merge queues can revolutionize CI/CD processes. You can check out his talk here:

Understanding Merge Queues

Merge queues are not an obvious choice for most teams, and they often require a shift in the balance between safety and speed of delivery. Deploying a merge queue means prioritizing quality over quantity, which is not an easy decision for many development teams, who might be pressured to ship fast.

For example, without a merge queue, teams often merge untested code. This is due to outdated test runs, meaning that they are deploying code that might not work. Without a merge queue, there is no way to prevent merging pull requests with outdated tests and breaking the CI for everyone. One of our customers faced this exact issue, which meant they needed the equivalent of a full-time engineer dedicated to tracking issues in the main branch that broke the CI.

Most platform engineers find the concept of moving the post-merge tests to pre-merge tests challenging.

The Trade-offs

This blog post from Vercel captures this common misunderstanding and the trade-off around merge queues and their CI costs and latency:

Despite the majority of commits being safe to merge after the local CI checks complete on their pull request, the merge queue will incur running the cost of running the CI again every time.

While this is true, the problem here lies in the word "majority." The definition of "majority" can vary significantly across teams. If a minority of pull requests break the main branch after merging, it can cause considerable downtime and require substantial effort from CI engineers to restore stability. We've seen teams come to Mergify with a 30% failure rate on their main branch. While a merge queue won't magically improve the failure rate, it ensures that it doesn't worsen, even if it means a small decrease in merge speed. That ensures that the effort invested in improving the CI is not wasted the day after.

Another perspective from Vercel states:

With merge queues, changes from developers depend on changes from other developers even if they are unrelated to each other, and this makes it hard to scale monorepo merge times with more developers.

This concern is valid for merge queues that don't support monorepo and queue parallelization. However, most modern merge queues (GitHub's own being an exception) do allow for optimization in these scenarios.

Vercel’s blog post concludes with:

With this workflow in place, the merge queue can be safely removed because checks will still always be run before users ever see the deployment.

This reflects the workflow of many teams that don't use a merge queue: merge, run tests on main, then deploy. However, this approach doesn't solve the issue of merging something that breaks the main branch. During the downtime, teams have to identify the culprit, revert changes, and ensure everything works, causing delays and frustration. Bad developer experience ensues.

Teams like Uber recognized this problem six years ago and started building their merge queues. Similarly, in OpenStack, we had a system supporting multiple repositories with Zuul over ten years ago.

Build New Solutions

Considering the merge queue adoption issues, we've spent the last few months reworking our merge queue system to simplify deployment and enhance user experience. We know for a fact that developers appreciate the reliability it brings to CI processes, but we also observe the difficulty of discovering and integrating the system. By deploying a merge queue, teams can eliminate the need for a "check that main works before deployment" step because this is done before the actual merge.

One notable example is a team that previously needed a full-time engineer to manage CI issues due to frequent breaks in the main branch. After adopting Mergify's merge queue, they drastically reduced these disruptions, allowing their engineers to focus on more productive tasks.

The Road Ahead

Merge queues are not without their challenges, and the trade-offs between safety and speed are not always apparent. However, we believe in their potential to transform development workflows. We're on the verge of redefining the merge queue concept at Mergify, and we think it has far greater potential than what has been realized over the past decade. I’ll be happy to write about that soon and share what we’ve built.

Continuous Rambling

Discussion about this post