Measuring code review is harder than it sounds. The obvious metrics — number of PRs reviewed, lines of code reviewed, comments left — are easy to track and mostly useless. They measure activity, not quality or impact. Worse, they can actively damage engineering culture if they're surfaced to engineers as performance signals: people learn to game them, and the underlying problems they were meant to surface get worse.
But the answer isn't to avoid measuring code review. Review is the highest-value quality gate in most engineering teams' process, and unmeasured processes don't improve. The challenge is identifying the metrics that actually correlate with what you care about — shipping velocity, code quality, team health — and ignoring the ones that correlate with nothing except review busywork.
This is a guide to building a review metrics framework that your team will actually act on.
Start with the DORA four — but understand what review metrics they connect to
The DORA (DevOps Research and Assessment) framework's four key metrics — deployment frequency, lead time for changes, change failure rate, and mean time to recover — are the most defensible high-level productivity measures for software engineering teams. They're outcome-oriented, not activity-oriented, which makes them resistant to gaming.
Code review metrics feed into these four, primarily through lead time for changes. The lead time chain looks like this:
commit → PR opened → first review → final approval → merge → deploy
Most teams have reasonable visibility into the first and last steps of this chain. The middle steps — the review portion — are where the largest time variation typically lives, and where the fewest teams have instrumented data.
If your lead time is 48 hours and your deploy-to-production time is 20 minutes, your 47+ hours of latency is sitting in the review portion of the chain. That's the problem to instrument and solve. The DORA metrics tell you there's a problem; review metrics tell you where in the chain it is.
The metrics that actually matter: a tiered view
Tier 1: Metrics worth tracking and acting on weekly
Time to first review (P50 and P90). The time between a PR being opened and the first substantive review activity (not just a "LGTM" stamp — actual comments or approval). This is the most sensitive leading indicator of review system health. P50 tells you typical behavior; P90 tells you what the long tail looks like. A P90 above 8 hours during business hours is a clear signal of structural bottleneck.
PR cycle time. Total elapsed time from PR open to merge, including all review iterations. This is a component of DORA lead time. Segment it by PR size and by author to find patterns: are large PRs the problem? Are certain engineers consistently waiting longer?
Review iteration count. How many back-and-forth cycles does a PR require before merge? High iteration counts (above 3 on average) suggest either PRs are landing without sufficient context, or reviewers are doing partial reviews and requiring multiple passes. Both are fixable — but you need the data to distinguish them.
Tier 2: Metrics worth tracking monthly for trend direction
Defect escape rate. The ratio of bugs found post-merge (in staging or production) to total PRs merged in a period. This is the quality output metric that review is supposed to improve. If defect escape rate is stable or rising despite review investment, your review process isn't catching what it should — and you need to understand why.
Reviewer load distribution. The Gini coefficient (or simply: top N reviewers as a percentage of all reviews) across your team. Skewed distribution — say, two engineers handling 60% of reviews — is both a bottleneck risk and a fairness issue. It also tends to correlate with higher time-to-first-review because the heavily-loaded reviewers become the critical path.
Review coverage by file/module. Which parts of the codebase get heavy review attention, and which parts are consistently merged with minimal review? This is particularly interesting when overlaid with your defect map: are low-review modules also your high-bug modules? They often are.
Tier 3: Metrics to interpret carefully (they can mislead)
Comment count per PR. More comments does not mean better review. Comment count correlates with PR size, code complexity, and reviewer thoroughness — but a reviewer who leaves five specific, high-value comments is doing better work than one who leaves twenty nitpicks. Track comment count for size-normalization purposes, but don't optimize for it.
Approval time. How quickly reviewers click "approve." Fast approvals can mean thorough reviewers who are focused and efficient, or rubber-stamp approvals that aren't actually reviewing anything. This metric requires additional context (what was the PR? who approved it?) before it's actionable.
Lines of code reviewed. Completely useless as a per-engineer metric. A 10-line security-critical change requires more careful review than a 500-line automated migration. Do not surface this metric to engineers.
The reviewer load problem: why it's harder to fix than it looks
Reviewer load imbalance is one of the most consistently damaging patterns in engineering team review processes, and it's also one of the most resistant to simple interventions.
The naive fix is round-robin assignment: distribute reviews evenly regardless of expertise. This does reduce load imbalance, but at the cost of review quality — engineers reviewing code in unfamiliar parts of the codebase leave worse comments and miss more issues. The implicit expertise routing that creates the imbalance exists for a reason.
The better model is informed load balancing: route reviews based on a combination of code ownership, recent familiarity with the changed files (who has touched this module in the last 60 days?), and current review load. This keeps expertise-based routing while preventing any one engineer from becoming the sole reviewer for an entire domain.
This is harder to implement than round-robin but meaningfully better in practice. The engineering investment is worth it once your team reaches the scale where load imbalance is consistently showing up in your metrics (typically around 8+ engineers on a single codebase).
Connecting review metrics to quality outcomes
The metric your leadership most likely cares about is defect escape rate — how often changes that pass review turn out to be buggy. But defect escape rate is a lagging metric: by the time it moves, you've already shipped the bugs.
The useful exercise is building a causal map from review process metrics to quality outcomes, so you can predict quality degradation before it happens. The relationships to test empirically:
- Does time-to-first-review above X hours correlate with higher defect escape rate in those PRs? (Often yes — PRs that sit in queue get rebased, the context gets stale, and reviewers review under more time pressure.)
- Do PRs with above-average iteration counts have lower post-merge defect rates, or higher? (This is non-obvious — more iterations can mean more thorough review, or it can mean unclear PRs that required more clarification without more quality.)
- Do PRs reviewed by engineers with familiarity in the changed module (recently committed to those files) have lower defect rates than those reviewed by less-familiar engineers? (Usually yes — but quantifying the gap helps you justify the routing investment.)
These correlations won't be linear or universal — they'll vary by codebase, team composition, and type of work. But testing them on your own data is how you build a review metrics program that's grounded in your team's actual behavior rather than generic industry benchmarks.
What to present to leadership, and what not to
Engineering leaders often want to present code review metrics upward — to VPs, CTOs, boards. The temptation is to present activity metrics (PRs reviewed, review speed) because they look like evidence of productivity. Resist this.
Activity metrics presented to non-technical stakeholders create the wrong incentives: engineers optimize for throughput rather than quality, and the numbers go up while the actual outcomes — shipping velocity, defect rate — go sideways or worse.
The metrics worth presenting upward are the DORA outcomes (lead time trend, deployment frequency, change failure rate over the last quarter) with a brief annotation of which review process changes drove the trend. That framing — "we improved lead time by 30% by reducing P90 time-to-first-review, here's how" — is both honest and meaningful to a non-technical audience.
Internally, with the engineering team, the detailed metrics belong in a shared dashboard that engineers can see and interpret themselves. Metrics that are only visible to managers, in a room engineers don't have access to, create surveillance anxiety and don't change behavior. Metrics visible to the team, with clear interpretive context ("this spike in P90 review wait time was during the sprint crunch — we're watching it"), build shared ownership of the process.
Avoiding the Goodhart trap
Goodhart's Law states: when a measure becomes a target, it ceases to be a good measure. This is a genuine risk with any engineering metrics program, and code review metrics are particularly vulnerable to it.
The typical failure mode looks like this: an engineering manager starts tracking time-to-first-review and shares it with the team. Engineers, wanting to improve the number, start leaving quick comments on PRs immediately after they're opened — not substantive review, just "looks good, will review more carefully later" — which satisfies the metric without doing any of the underlying work.
To make metrics Goodhart-resistant, a few principles help. First, measure outcomes alongside process metrics: if time-to-first-review improves but defect escape rate rises, the process metric is being gamed. Second, avoid making process metrics individual performance signals — when they're team-level aggregates, the incentive is collective improvement rather than individual optimization. Third, rotate which metrics are actively tracked and discussed: when engineers don't know which metric is "the one that's being watched this month," they're less likely to optimize narrowly for any single one.
The most Goodhart-resistant review metrics are the ones that require genuine behavior to move: defect escape rate (hard to fake), PR iteration count (correlated with review quality, not just speed), and reviewer load distribution (hard to improve without actually distributing work). Use these as your primary quality checks on whether the process metrics are reflecting real improvement.
Building a baseline before changing anything
If you're starting this process from scratch, the most important first step is establishing a baseline before making any process changes. Changes without a baseline make it impossible to know whether interventions are working.
At minimum, capture three weeks of historical data on: PR cycle time, time to first review, and review iteration count. Segment by author, by PR size, and by affected module. This baseline will tell you where your biggest problems actually are — which may not be where you assumed.
Many teams who go through this exercise find their biggest problem is not slow review but late review: PRs that get approved quickly but only after sitting for a long time because no reviewer picked them up. Others find the opposite: reviews start immediately but require many back-and-forth cycles. These require different interventions, and you can't know which you have without the data.
Review metrics work when they're honest about what they measure, specific enough to be actionable, and shared openly with the team. They fail when they measure activity instead of outcomes, when they're surfaced as individual performance indicators, or when they're collected but never acted on. The goal is not a beautiful dashboard — it's a review process that your engineers trust and that produces code you can ship confidently.