Skip to main content

Case Study · Tooling

Code Review by Consensus

McVroom, N. (2026). Code Review by Consensus. TPC Technical Reports, TPC-2026-001. nickmcvroom.com/work/tribunal-ai-code-review
GitHub Actions AI TypeScript Code Review

Abstract

Tribunal orchestrates multi-model AI code review through a three-tier escalation cascade with cross-provider consensus. It renders verdicts, not comments.

Problem

AI code review that is advisory, not authoritative

Approach

Multi-model consensus with tiered escalation

Outcome

Automated PR approval that branch protection trusts

If you’re a solo developer with branch protection enabled, you have a problem. GitHub won’t let you approve your own pull requests. So you either merge with admin bypass, which defeats the entire purpose of having branch protection, or you accept that your review process is decorative.

Every automated reviewer I tried offered the same half-solution: read the diff, post some comments, move on. Comments are advisory. They don’t set the approval state on the PR. You still can’t merge without bypassing protection. The review bot is technically present but functionally toothless.

That’s why I built Tribunal. It doesn’t just comment; it renders a verdict. APPROVE or REQUEST_CHANGES, submitted as a proper GitHub review status. Branch protection sees a legitimate approval from a reviewer that isn’t you. No admin bypass. No ceremony.

But if you’re going to trust an automated reviewer to gate your merges, “one model, one opinion” isn’t good enough. A single model hallucinating a false positive blocks your PR. A single model missing a real bug lets it through. So Tribunal assembles a panel of models from different providers, has them review independently, then synthesises their findings through a weighted consensus process. A finding that two models from different capability tiers flag independently is a stronger signal than three models from the same tier agreeing. Cross-validation, not majority vote.

It runs as a GitHub Action. Bring your own API keys. Your code never touches my infrastructure.

The Escalation Cascade

Tribunal organises models into three tiers, each named for increasing authority. Every PR goes through the first tier. Higher tiers activate only when the evidence warrants it.

Tribunal runs on every PR. The panel is fast, cheap models: Haiku, GPT-5-mini, Grok-3-mini, Gemini 3 Flash (preview). They review for the same categories: bugs, security issues, convention violations, and suggestions. The goal isn’t depth; it’s breadth. Four models from four providers catching low-hanging fruit fast.

Conclave activates when Tribunal surfaces medium or high confidence issues, when the PR is large, or when changes touch sensitive file paths. The panel steps up to mid-tier models: Sonnet, GPT-5, Grok-4, Gemini 2.5 Pro. Each receives the prior tier’s findings as context, so they’re not re-discovering what Tribunal already found. They’re verifying, deepening, and catching what the fast models missed.

Sanctum activates only when Conclave surfaces critical or high-severity security or bug findings. The panel is frontier models: Opus, GPT-5.2, Gemini 3 Pro (preview), GLM-5. This tier exists for the rare PR where something genuinely dangerous might be shipping. The cost of running frontier models on every PR is unjustifiable. The cost of missing a critical security issue is worse.

Each tier receives the full context of every tier below it. Sanctum knows what both Conclave and Tribunal found. This is “increasing certainty”, the same concerns examined with progressively more capable reasoning, not “different concerns” examined by specialised models.

TribunalHaiku, GPT-5-mini, Grok-3-mini, Gemini 3 FlashEvery PRmedium+ confidence / large diff / sensitive pathsConclaveSonnet, GPT-5, Grok-4, Gemini 2.5 ProWhen warrantedcritical / high-severity security or bug findingsSanctumOpus, GPT-5.2, Gemini 3 Pro, GLM-5RareFig. 1 — Three-tier cascade. Each tier receives findings from all tiers below it.

Consensus Through Cross-Validation

Raw findings from multiple models are noisy. Four models reviewing the same diff will flag overlapping issues with different wording, different line references, and different severity assessments. Posting all of them as PR comments would be worse than posting none.

The consensus pipeline turns noise into signal through five steps.

Matching groups findings across models by file, type, and location. Two findings pointing at the same file within five lines of each other, flagging the same category of issue, are treated as the same finding from independent sources.

Confidence scoring weights matched findings by tier diversity. A bug flagged by one Tribunal model and one Conclave model scores higher than the same bug flagged by three Tribunal models. The reasoning is straightforward: if a fast model and a more capable model independently identify the same problem, that’s genuine cross-validation. If three fast models agree, they might be making the same mistake: similar architectures, similar training data, similar blind spots.

Severity matrix maps confidence against finding type on a four-by-four grid. Security findings are amplified: even a low-confidence security finding escalates to medium severity. A high-confidence bug is critical. A low-confidence style suggestion stays low. The matrix encodes the asymmetry between “we missed a real bug” and “we flagged a style nit”; the cost of false negatives isn’t symmetric across categories.

Verdict is binary. If any finding reaches critical or high severity after the matrix, the PR gets REQUEST_CHANGES. Otherwise, APPROVE. No “maybe” state. No “looks mostly fine.” The review either found something worth blocking on, or it didn’t.

Verification is the final pass. A Haiku instance reads the actual file context, not just the diff, around every flagged line and attempts to refute each finding. This catches the most common failure mode of AI code review: flagging something that looks wrong in the diff but is correct when you see the surrounding code. False positives that survive verification have earned their place in the review.

Matchfile + line + typeConfidencetier diversitySeverity4x4 matrixVerdictAPPROVE / REJECTVerifyrefute false positivesrefuted findings removedFig. 2 — Raw findings become actionable verdicts through five stages.

In Practice

A PR adding a time-pressure system demonstrated the full cascade. In the Tribunal tier, GPT-5-mini flagged a bug: resume() in a timer hook unconditionally sets startTimeRef even when the mode is zen, violating the hook’s contract that zen mode produces no timing data. The other three Tribunal models returned clean.

One model with a medium-confidence finding was enough to trigger Conclave. GPT-5 independently flagged the same bug at the same location. Grok-4, Sonnet, and Gemini 2.5 Pro found nothing. Two models in two different tiers, arriving at the same conclusion without seeing each other’s work; the consensus engine scored this as high confidence and the severity matrix elevated it to HIGH.

HIGH severity triggered Sanctum. GPT-5.2 confirmed the zen-mode bug and surfaced a second issue the lower tiers missed entirely: pause() doesn’t cancel the active requestAnimationFrame tick, so an extra frame fires after every pause. Opus, Gemini 3 Pro, and GLM-5 found nothing.

Final verdict: REQUEST_CHANGES. Three models across all three tiers converged on the same bug. Twelve models ran in total. Cost: $0.37.

Compare that to a refactoring PR that extracted duplicated pressure-zone mappings. Tribunal’s four models ran; only GPT-5-mini found suggestions. Conclave activated on borderline confidence; GPT-5 agreed on one finding. Nothing reached critical severity, so Sanctum never activated. Verdict: APPROVE with three low-severity suggestions. Eight models, two tiers, $0.10.

The cascade paid for itself in the first case; a real bug caught by cross-tier validation that no single model found with enough confidence to act on alone. It stayed cheap in the second, where the evidence didn’t warrant frontier models.

BYOK and the Trust Problem

Tribunal is bring-your-own-keys by design. Users configure API keys for each provider they want to use in their GitHub Actions secrets. The models are called directly from the Action runner; diffs and findings flow between the runner and the provider APIs. Nothing passes through any intermediary.

This solves two problems at once. The cost problem: users control their own API spend, and the tiered escalation means most PRs only incur the cost of four fast-model calls. The trust problem: sending proprietary code to a third-party review service is a non-starter for many teams. With BYOK, the code goes to the same model providers the team likely already has agreements with. No new vendor, no new data processing terms, no new attack surface.

The trade-off is that users need API keys from multiple providers to get the full benefit. A single-provider configuration still works, Tribunal will run with whatever models are available, but the cross-provider diversity that makes the consensus mechanism valuable is reduced.

Infrastructure Integration

Tribunal doesn’t exist in isolation. It’s configured as a reusable GitHub Action in a shared utilities repository and referenced by CI workflows generated by LATHE, a project scaffolding tool. Every project scaffolded by LATHE gets Tribunal review on its first PR without any additional configuration.

This creates a self-reinforcing pipeline: LATHE generates the project structure, the CI configuration, and the infrastructure tests. Tribunal reviews every PR against the generated codebase. The tooling opinions, naming conventions, file structure, test patterns, are enforced by the same AI review that checks for bugs and security issues. The standards are embedded in the automation, not documented in a wiki that nobody reads.

What I’d Build Next

The verification pass currently uses a single Haiku instance. It should use the same cross-validation approach as the review itself; multiple verifiers with consensus on which findings to keep and which to refute. A finding that one verifier refutes and another confirms is genuinely ambiguous and should be flagged as such rather than silently dropped.

The escalation triggers are rule-based: file count thresholds, path pattern matches, confidence levels. They should learn from historical review accuracy. If Conclave consistently confirms Tribunal’s findings for a particular repository without adding new ones, the trigger threshold for that repo should rise. If Conclave frequently overturns Tribunal findings, the threshold should drop.

There’s no feedback loop from human reviewers. When a developer dismisses a Tribunal comment or resolves it without changes, that signal is lost. Feeding human review decisions back into the confidence scoring would improve accuracy over time; findings that humans consistently dismiss should lose weight, findings they consistently act on should gain it.

The action has no telemetry. It doesn’t log wall-clock timing, escalation rates, or false positive rates. Writing this case study exposed the gap: I couldn’t back claims about review speed or escalation frequency with data because the data doesn’t exist. Adding lightweight instrumentation (tier durations, escalation triggers, verdict distribution, finding survival rate through verification) would turn Tribunal from a tool I trust by feel into one I can tune by evidence.

Outcome

Tribunal runs on every PR across the CRACKPOINT, AFTERTOUCH, and LATHE repositories. The three-tier architecture means most reviews only incur the cost of Tribunal’s fast models: escalation to Conclave or Sanctum is the exception, not the norm. Tribunal-only reviews run all models in parallel, so they’re bounded by the slowest model’s response time. Escalated reviews take longer since tiers run sequentially, but the expensive models only activate when the evidence demands it.

The design principle is simple: no single model is reliable enough to trust with code review. But a panel of models, from different providers, at different capability tiers, cross-validated through weighted consensus and verified against full file context; that’s a review process worth paying attention to.

References

McVroom, N. (2026). Infrastructure That Tests Itself. TPC Technical Reports, TPC-2026-002./work/lathe-infrastructure-testing