RubricEval
Define a versioned rubric of weighted, gated criteria; an LLM grades a submitted GitHub repo or ZIP against the real code, but a deterministic policy in code makes the final accept / review / reject call — reproducible and auditable.
Gallery
The problem
Most 'AI code reviewers' tangle three things together: what to evaluate, how to judge it, and how to decide. RubricEval separates them — the rubric is versioned data, the LLM only grades each criterion against the real files, and a pure policy function makes the decision — so one prompt tweak can't silently flip every result, and every decision is reproducible.
My role
Designed and built the full-stack platform — a FastAPI + async SQLAlchemy backend with a durable job queue, behind a Next.js / TypeScript frontend.
How it works
Engineering highlights
The model grades, code decides
An LLM scores each criterion, but a pure, exhaustively-tested policy function makes the accept / review / reject call — so decisions are reproducible and never at the mercy of LLM nondeterminism.
Evidence verified against real files
Every citation the model makes (path, line range, quote) is checked against the actual code and flagged if it can't be verified — fabricated evidence can't masquerade as proof.
Rubrics are versioned data
Each rubric is canonicalized and content-hashed; every review records the rubric hash, model, and prompt version, so a published rubric can't silently change and any result is reproducible.
Built to run reliably
A durable leased job queue (survives crashes, scales across workers), live replayable SSE streaming, a FakeLLM port for offline/CI runs, and a golden-set regression harness.
Outcomes
- Turns code review into a repeatable, auditable, rubric-driven decision — not a one-off LLM opinion.
- Runs the whole engine deterministically offline (FakeLLM) and measures every prompt/model change against a golden set.