Velocity

Why PR review bottlenecks kill engineering velocity (and what to do about it)

Abstract illustration of code review queue bottleneck

The pull request has been open for nineteen hours. The author has pushed two more commits since opening it — one to address a comment from a reviewer who only skimmed the first file, one to fix a rebase conflict from a different PR that landed while this one sat idle. The original reviewer hasn't responded yet. A second reviewer was requested this morning and hasn't looked at it.

This is not a story about bad engineers. It's a story about a system that was never designed to scale — and most teams are deep inside it without quite realizing how much it's costing them.

The compounding math of review wait time

A PR that waits 20 hours for first review isn't a 20-hour delay in isolation. It's a 20-hour delay per PR, multiplied by however many PRs your team ships per week, compounded by the context-switching cost of returning to code you wrote yesterday.

The DORA metrics framework identifies "lead time for changes" — the time from commit to production — as one of four key indicators of engineering performance. Review wait time is the single largest controllable component of that metric for most teams. Deployment frequency, change failure rate, and time to restore are all downstream of whether changes can move through review without sitting in a queue.

Consider a concrete scenario: a 12-engineer team shipping around 40 PRs per week. If average first-review wait time is 18 hours, and the average PR requires 1.4 review cycles (a realistic number for substantive code changes), you're looking at roughly 25–30 cumulative engineering-hours per week spent context-switching back into PRs after review delays. That's the equivalent of one engineer spending three-quarters of their week doing nothing but context recovery.

The psychological cost is harder to measure but arguably larger. Context depth is expensive to rebuild. Engineers who know their work will sit in queue often compensate by making PRs larger — reasoning that one 600-line PR is less context-switching overhead than four 150-line ones. This makes each review harder, which makes review slower, which increases wait times. The loop is self-reinforcing.

What actually causes review bottlenecks

Most engineering managers I've talked to identify the same root causes when they dig into their data. The distribution is usually something like this:

  • Reviewer assignment asymmetry. A small number of engineers — typically staff-level or longer-tenured — are implicitly trusted with anything complex. They become the critical path. Their review queue grows faster than they can clear it.
  • PR scope creep. PRs that are too large don't just take longer to review; they signal to reviewers that deep engagement is required, so reviewers defer them until they have a "proper block" of time that never arrives.
  • Review without signal. When a reviewer opens a 400-line diff, nothing tells them where to focus. They read everything or they skim everything. Neither is efficient. Both are worse than targeted review.
  • Feedback loops that restart review cycles. A comment on line 47 triggers a change that affects logic in file 3, which the reviewer didn't check again after the re-push. Now there's uncertainty about whether the fix introduced anything new.

These aren't primarily people problems. They're workflow design problems — and that distinction matters for what you do next.

The "just write smaller PRs" advice and why it's incomplete

The standard advice is to keep PRs under 400 lines, ship atomic changes, and split refactors from features. This is correct and worth enforcing. But it's incomplete advice for two reasons.

First, PR size is a symptom as much as a cause. Teams ship large PRs partly because small PRs take the same review overhead — same back-and-forth cycles, same wait times — so the marginal cost of adding 200 more lines is low. Fixing PR size without fixing review throughput means you've split the problem into more pieces without reducing its total mass.

Second, small PRs can still contain complex semantic changes. A 60-line diff that refactors an authentication flow or changes how a background job accumulates state can be as risky as a 600-line feature addition — possibly more so, because the diff is deceptively short and reviewers may underestimate the blast radius.

We're not saying PR size doesn't matter — it absolutely does, and keeping diffs focused is one of the highest-impact habits a team can build. We're saying that size reduction alone doesn't address the core problem: reviewers lack the right signal about where to spend their attention and what the actual risk surface is for a given change.

Review assignment: the hidden bottleneck multiplier

Most teams use one of two reviewer assignment strategies: round-robin (fair but uninformed) or implicit expertise routing (accurate but bottleneck-creating). Neither is optimal.

A better model distributes review load based on a combination of code ownership, recent familiarity with the changed files, and current review load. The "recent familiarity" component is critical — someone who touched the billing module two weeks ago will write a better review of a billing PR than someone who owns the module but hasn't been in that area of the codebase for three months.

Teams that instrument their review assignment see consistent patterns: the top two or three reviewers by volume receive 40–60% of all review requests, while the rest of the team is underutilized. This isn't just unfair to those engineers; it's a systemic risk. When those reviewers are out, reviews stop. When they're in the middle of deep work, reviews slow to a crawl.

Distributing review load doesn't mean ignoring expertise. It means making the routing algorithm explicit rather than implicit, so it can be examined and tuned rather than just accepted.

What high-performing teams do differently

Teams that consistently maintain low PR cycle times — under 24 hours end-to-end, including actual review time, not just wait time — tend to share several practices:

They separate structural from semantic review. Formatting, naming, and style questions go to automated tooling (formatters, linters, style checkers) before a human ever opens the diff. Human review time is spent exclusively on logic, architecture, and risk.

They surface review priority signals inline. Rather than asking reviewers to read an entire diff to understand risk, they provide context at the point of change: which functions are affected, which callers depend on changed behavior, which paths are untested. When reviewers know where to look, they look there first.

They close the feedback loop with metrics. Teams that measure review cycle time, first-review wait time, and review iteration count can spot when a change in workflow (a new engineer, a new type of work, a sprint crunch) starts degrading performance. Teams that don't measure can only notice the problem when it's already severe.

They treat review as a shared team obligation, not a favor. This is a culture point more than a tooling point, but it's enabled by tooling: when the review queue is visible to everyone, when requests are distributed fairly, and when the cost of delays is measured, review becomes a team responsibility rather than something you squeeze in between "real work."

The metric worth watching first

If you're starting from zero on review metrics, the single most useful number to track is time to first review — not total cycle time, not merge time, but specifically: how long does a PR wait after being opened before it receives any substantive review activity?

This metric is leading, not lagging. It doesn't tell you whether a PR was good or bad; it tells you whether your review system has enough throughput to handle your team's output rate. If your median time to first review is above four hours during business hours, you have a structural bottleneck, not a people problem. And structural bottlenecks have structural solutions.

The second metric worth tracking is review iteration count: how many back-and-forth cycles does a typical PR require before merge? More than two cycles on average usually signals that either PRs aren't being described clearly, or reviewers aren't reviewing in depth on the first pass. Both are fixable — but you won't know which until you measure.

A note on tooling

Better tooling doesn't fix a broken review culture, but it does change the floor. When reviewers have inline signal about test gaps and semantic risk, they can spend their cognitive budget on things humans are genuinely better at — design questions, architectural tradeoffs, edge cases that automated analysis doesn't cover.

The goal isn't to replace the reviewer. It's to make each review session count more, so fewer of them are needed, so the queue moves faster, so PRs don't sit for nineteen hours because the one person who could give a confident review is context-switching between four other demands.

Review bottlenecks are not inevitable. They are a property of systems that were built incrementally, without measurement, and without deliberate design. They are also, for that reason, fixable.