Skip to content
All projects
2025A rubric-driven code-evaluation platform

RubricEval

Define a versioned rubric of weighted, gated criteria; an LLM grades a submitted GitHub repo or ZIP against the real code, but a deterministic policy in code makes the final accept / review / reject call — reproducible and auditable.

Gallery

The problem

Most 'AI code reviewers' tangle three things together: what to evaluate, how to judge it, and how to decide. RubricEval separates them — the rubric is versioned data, the LLM only grades each criterion against the real files, and a pure policy function makes the decision — so one prompt tweak can't silently flip every result, and every decision is reproducible.

My role

Designed and built the full-stack platform — a FastAPI + async SQLAlchemy backend with a durable job queue, behind a Next.js / TypeScript frontend.

How it works

01Repo or ZIP
02Ingest + normalize
03Grade per criterion
04Verify evidence
05Deterministic policy
06Streamed report

Engineering highlights

The model grades, code decides

An LLM scores each criterion, but a pure, exhaustively-tested policy function makes the accept / review / reject call — so decisions are reproducible and never at the mercy of LLM nondeterminism.

Evidence verified against real files

Every citation the model makes (path, line range, quote) is checked against the actual code and flagged if it can't be verified — fabricated evidence can't masquerade as proof.

Rubrics are versioned data

Each rubric is canonicalized and content-hashed; every review records the rubric hash, model, and prompt version, so a published rubric can't silently change and any result is reproducible.

Built to run reliably

A durable leased job queue (survives crashes, scales across workers), live replayable SSE streaming, a FakeLLM port for offline/CI runs, and a golden-set regression harness.

Outcomes

  • Turns code review into a repeatable, auditable, rubric-driven decision — not a one-off LLM opinion.
  • Runs the whole engine deterministically offline (FakeLLM) and measures every prompt/model change against a golden set.