Skip to main content

Paper List

An overview table followed by taxonomy-grouped annotations. Click any row in the table to open the source link directly; expand the cards below for detailed notes.

Overview

16 / 16 papers

Annotated Reading List

Generation strategy

Direct Generation

Generate rubrics directly from the task description, scoring goals, and a few examples.

Direct generationQuery-specificLLM-as-judge
View source
Core idea

Treats rubric generation as a conditional generation problem: task description and scoring target go in, a query-specific rubric comes out.

Why it matters

Establishes a baseline for automatic, query-conditioned rubric generation and shows where direct LLM generation struggles in specialized domains.

Direct generationAdaptive
View source
Core idea

Adaptive rubric generation framework that adjusts criteria to the current task and scoring context.

Why it matters

Illustrates how to push beyond a single static rubric toward task-aware criteria selection.

Generation strategy

Retrieval-Augmented Generation

Retrieve similar tasks or domain rubrics first, and then generate the rubric for the current task.

RAGDomain knowledgeInterpretable evaluation
View source
Core idea

Augments rubric generation with retrieved domain knowledge or rubric exemplars before asking the model to produce the final criteria.

Why it matters

Addresses the shallowness of direct generation in professional domains where domain knowledge is essential.

Generation strategy

Preference-Driven Extraction

Rather than writing rubrics by hand, infer explicit criteria from preference pairs, judge data, or reward modeling signals.

Preference-drivenReward modelingCriteria extraction
View source
Core idea

Reverse-engineers rubric criteria from preference pairs and reward modeling data so that criteria reflect what humans actually prefer.

Why it matters

Connects rubric construction with reward modeling and alignment pipelines, bridging evaluation and training-time signals.

Generation strategy

Refinement and Decomposition

Do not generate a rubric from scratch; instead decompose, rewrite, and de-bias an existing rubric.

DecompositionAtomic criteriaOpen-ended tasks
View source
Core idea

Reframes rubric generation as decomposition and rewriting: split coarse rubrics into finer atomic criteria and reduce redundancy.

Why it matters

Shows that rubric quality for open-ended tasks often depends more on decomposition than on the original draft.

RefinementHuman feedbackEssay scoring
View source
Core idea

Iteratively refines a rubric by analyzing disagreement with human scores and rewriting the criteria accordingly.

Why it matters

Treats rubric design as prompt optimization guided by human-model disagreement.

Iterative refinementJudge calibration
View source
Core idea

Iteratively revises evaluation criteria based on feedback signals and observed judging errors.

Why it matters

Provides a general template for feedback-driven rubric refinement beyond a single dataset.

Generation strategy

Human-in-the-Loop and Expert-Authored Rubrics

Rely on experts or expert-labeled rubrics, often supplemented by model-based generators for scale.

Expert rubricsHealthcareBenchmark
View source

OpenAI

Core idea

Uses expert-authored rubrics with explicit items, scores, descriptions, and failure conditions for health-related LLM evaluation.

Why it matters

Strong example of why high-stakes domains require expert-defined rubrics with codebook-level detail.

Delphi methodConsensusClinical evaluation
View source
Core idea

Builds a clinical rubric through a multi-specialty Delphi panel, with iterative scoring and a 75% agreement threshold.

Why it matters

Demonstrates how consensus-driven processes produce trustworthy rubrics for sensitive domains.

Multi-stakeholderLegal AIHuman-centered
View source
Core idea

Involves legal experts, end users, and technical staff together when defining quality standards for legal AI answers.

Why it matters

Shows that rubric design for user-facing systems may require multi-stakeholder input, not just domain experts.

Human-in-the-loopGeneratorSelf-adaptive
View source

Xiaohongshu

Core idea

Trains a rubric generator from expert-labeled rubrics, then uses it to scale up rubric-based scoring.

Why it matters

Practical human-in-the-loop pipeline that balances expert quality with automation.

BenchmarkExpert tasksRubric-based evaluation
View source
Core idea

Benchmark suite of expert-level tasks where rubrics are the primary evaluation mechanism.

Why it matters

Provides a shared testbed for expert-level rubric-based evaluation across tasks.

Setting

General-Purpose vs Domain-Specific Rubrics

Rubric design differs substantially between open-ended general tasks and specialized professional domains.

General-purposeDisagreement modelingMulti-dimensional
View source

Microsoft

Core idea

Treats rubric evaluation as a multi-dimensional scoring problem where human judges themselves disagree, and models this disagreement explicitly.

Why it matters

Motivates rubric designs that tolerate and model judge disagreement rather than assuming a single ground truth score.

How to add a new paper

Overview table — add an object to the papers array at the top of this file:

{
title: "Paper Title",
href: "https://arxiv.org/abs/xxxx",
org: "Institution",
year: "2025",
usage: "Rubric as Reward", // tag for 如何使用 Rubrics
stage: "Training-Time", // tag for 按模型训练阶段
}

Annotated card — add a <PaperCard> inside the matching <PaperGroup>:

<PaperCard
title="…"
href="…"
venue="…" year="…" type="…"
idea="…"
why="…"
tags={['…']}
/>