The diff looked fine. Three functions modified, one added. Types all check out, tests pass, CI green. The reviewer — a senior engineer who's been working in this codebase for two years — approved it after fifteen minutes. The PR merged on a Thursday afternoon.
The incident was on Monday. A specific cohort of users — accounts created before a schema migration eighteen months ago, with a particular flag combination in their profile — started seeing incorrect billing calculations. It took eleven hours to trace.
The root cause wasn't in any of the modified functions individually. It was in the interaction between a changed accumulator logic in calculateUserCredit() and a calling convention that three different billing flows depended on. The function's return type hadn't changed. Its behavior had — by about 3% of inputs, all belonging to that one cohort.
This is semantic drift. And it is far more common than most engineering teams realize.
Defining the problem precisely
Semantic drift refers to code changes that are syntactically correct, type-safe, and locally consistent — but alter the behavioral contract of a function or module in ways that are invisible at the diff level.
It's distinct from ordinary bugs in an important way: ordinary bugs are usually detectable by the person writing the code, or by a reviewer looking at the changed lines. Semantic drift is undetectable at the diff level because the problematic consequence exists at the call site, not at the function definition. The function looks correct. The function is correct, in isolation. The problem is what the function used to mean to its callers — and what it now means.
The key word is "drift." A single isolated change to a helper function is usually fine. The problem accumulates when: (1) a function with multiple callers in different contexts changes its behavior subtly, (2) the callers were written with the previous behavior as a precondition, and (3) no test covers the specific combination that breaks.
Why existing tools miss this
The standard toolkit for catching bugs before production includes linters, type checkers, unit tests, and integration tests. Each of these is genuinely valuable. None of them is designed to catch semantic drift, and here's why:
Linters operate on the AST of individual files. They can catch style violations, obvious errors, and some classes of logical mistake. They have no model of how a function is used across the codebase, so they can't reason about whether a behavioral change in one file affects the assumptions in another.
Type checkers verify that types are consistent. Semantic drift, almost by definition, doesn't change types — it changes the meaning of values within a type. A function that previously returned a non-negative integer now sometimes returns a negative one. The type is still int. The type checker passes.
Unit tests catch regression if the test coverage is comprehensive and the specific changed behavior is covered. Most of the time, it isn't. Tests are written to cover the behavior that was intended at the time — not behavior that callers in different contexts depend on implicitly. The gap between "what tests cover" and "what callers assume" is exactly where semantic drift lives.
Integration tests can catch the combination failures — but they're typically slow, run in CI after the PR merges, and often don't have the specific input combination that triggers the broken behavior (because that combination was rare or not anticipated).
The result: semantic drift tends to be caught in production, by users, under specific input conditions, hours or days after merge.
The AST is not enough — you need the call graph
Detecting semantic drift requires reasoning about the relationship between a changed function and the rest of the codebase. That relationship is encoded in the call graph: a directed representation of which functions call which other functions, under what conditions, and with what parameter shapes.
Consider a simplified scenario. A function applyDiscount(cart, user) is changed to return a discount value that can now be negative for certain loyalty program edge cases. Before the change, the caller finalizeOrder() trusted that applyDiscount() would never return a value below zero, and used the result directly in a price calculation without clamping. After the change, finalizeOrder() is technically broken — but its code hasn't changed, its tests still pass (the test fixtures don't include those edge cases), and the reviewer who approved applyDiscount()'s modification never saw finalizeOrder().
To detect this class of issue at PR time, you need to:
- Parse the changed functions and model what behavioral assumptions they previously provided to callers (implicit range constraints, null safety guarantees, ordering guarantees).
- Traverse the call graph to identify all callers that depended on those assumptions.
- Compare the new behavior against what callers expect.
This is not something a human reviewer can reliably do for a codebase of any real scale. A 500,000-line TypeScript monorepo has a call graph with hundreds of thousands of edges. A reviewer can hold maybe 10–20 related function relationships in working memory at once. The analysis needs to be automated.
What makes a behavioral change "semantic" vs. just a bug
There's a reasonable objection here: isn't all of this just "write better tests"? If the callers had comprehensive tests, they'd catch the regression.
This is partially true, and we're not saying better test coverage is unimportant — it clearly is. What we're saying is that semantic drift is a distinct category of problem that requires a distinct detection approach, for two reasons.
First, comprehensive test coverage that accounts for all cross-function behavioral assumptions is not a realistic target for most codebases. The combinatorial space of "function A relies on property P of function B, which was written by a different engineer three years ago" is enormous. Teams make pragmatic choices about coverage, and those choices will always leave gaps.
Second, semantic drift detection at the call-graph level can identify gaps that test authors didn't know to fill. If the analysis tells you "this changed function is called from a payment flow that doesn't test the negative-value case you just introduced," you can add that test before the PR merges — not because you reviewed the caller, but because the analysis traversed the graph for you.
A concrete detection pattern
The way we approach semantic drift detection in practice involves three layers of analysis on every PR diff:
1. Invariant extraction. Before and after the change, we parse the function body to extract implicit invariants — return value constraints, precondition checks, exception conditions. This uses a combination of AST analysis and lightweight dataflow analysis to model what the function's callers could have safely assumed.
2. Call graph traversal. We traverse the repository's call graph (built on the default branch baseline) to identify all direct and transitive callers of the changed function. For each caller, we check whether its usage of the changed function's return value or side effects appears to depend on an invariant that the PR has modified.
3. Test coverage intersection. We cross-reference the changed behavior with the existing test suite for the calling functions. If a caller's tests don't exercise the new behavioral branch, that's a specific, actionable flag: "caller X depends on a changed invariant and has no test covering the affected path."
This three-step process is computationally heavier than a linter pass, but well within the latency budget for PR-time analysis (typically completing in 15–45 seconds for medium-sized repos). The signal it produces is qualitatively different from what lint or type checking can provide.
The limits of automated semantic analysis
It's worth being direct about what this approach can and can't catch. Automated call-graph analysis is effective at finding violations of structural invariants — things that can be modeled by examining the code itself. It is less effective at finding violations of domain invariants — behavioral contracts that exist only in the minds of the engineers who built a system and are not encoded in code or tests.
Example: "orders for wholesale accounts should never be processed through the retail discount pipeline" is a domain invariant. It might or might not be enforced in code. If it's only in documentation or institutional knowledge, static analysis won't find it. A senior engineer who knows the domain will. This is one of several reasons human code review remains valuable even with comprehensive automated analysis — but it suggests where human attention should be focused: on domain semantics, not on structural invariants the tool can already surface.
The drift accumulation problem
One more dimension worth naming: semantic drift is not always introduced in a single PR. It accumulates. Each small change to a function that subtly shifts its behavioral contract narrows the gap between "what callers assume" and "what the function actually does." The system doesn't break until the gap closes — and by then, many PRs have contributed to it, making root cause analysis genuinely difficult.
This means that PR-time detection has a preventive value beyond just catching individual problematic changes. If every PR that modifies a widely-used function flags the callers who may be affected, engineers build a habit of explicitly validating or updating those callers. Over time, implicit behavioral contracts become explicit — encoded in tests, in assertions, or at minimum in review comments. The codebase becomes more legible, not just because the current PR is safer, but because the act of surfacing the analysis repeatedly builds shared understanding of the system's structure.
Semantic drift is the bug class you only notice when it costs you. The analysis infrastructure to catch it exists — it's not novel computer science. The gap is that most review processes don't invoke it at the right moment: while the PR is still open and the author is still looking at the code.