Paper List

An overview table followed by taxonomy-grouped annotations. Click any row in the table to open the source link directly; expand the cards below for detailed notes.

Overview

16 / 16 papers

#	Paper	Year	How Rubrics Are Used	Model Stage	Link
1	Checklists Are Better Zero-Shot Evaluators	2024	Rubric as Reward	Training-Time	Link
2	Training AI Co-Scientists to be Better at Reasoning	2025			Link
3	HealthBench: Evaluating Large Language Models Towards Improved Human Health	2025			Link
4	PRBench: A Large-Scale Benchmark for Pull Request Review	2025			Link
5	AdaRubric: Task-Adaptive Rubric Generation	2025	Generate DPO Pair		Link
6	Configurable Preference Optimization with Rubric-Based Evaluation	2024	Generate DPO Pair	Training-Time	Link
7	Rubrics as Rewards: Reinforcement Learning with LLM-Evaluated Criteria	2025	Rubric as Reward	Training-Time	Link
8	AutoRubric: Unifying Rubric Generation and Feedback	2025	Rubric as Feedback		Link
9	Alternating Reinforcement Learning from Human Feedback with Rubric-Driven Criteria	2024	Rubric Generator		Link
10	Agentic Rubrics as Context-Sensitive Reward for LLM Alignment	2025	Rubric Select Ans	Inference-Time	Link
11	Automated Rubrics for Essay Scoring via Reflect-and-Revise	2024	Generate Rubrics	Test-Time	Link
12	SedarEval: Automated Evaluation using Self-Adaptive Rubrics	2025	Generate Rubrics	Test-Time	Link
13	LLM-RUBRIC: A Multidimensional, Calibrated Approach to Automated Evaluation	2023	Rubric Generator	Test-Time	Link
14	RubricEval: A Rubric-Based LLM Evaluation Framework	2024			Link
15	Learning Query-Specific Rubrics for Open-Ended Evaluation	2024			Link
16	An Efficient Rubric-based Evaluation Paradigm for LLMs	2025			Link

Annotated Reading List

Generation strategy

Direct Generation

Generate rubrics directly from the task description, scoring goals, and a few examples.

Learning to Judge: LLMs Designing and Applying Evaluation Rubrics2024 · Preprint · MethodDirect generation+2

Direct generationQuery-specificLLM-as-judge

View source

Core idea

Treats rubric generation as a conditional generation problem: task description and scoring target go in, a query-specific rubric comes out.

Why it matters

Establishes a baseline for automatic, query-conditioned rubric generation and shows where direct LLM generation struggles in specialized domains.

AdaRubric: Task-Adaptive Rubric Generation2025 · MethodDirect generation+1

Direct generationAdaptive

View source

Core idea

Adaptive rubric generation framework that adjusts criteria to the current task and scoring context.

Why it matters

Illustrates how to push beyond a single static rubric toward task-aware criteria selection.

Generation strategy

Retrieval-Augmented Generation

Retrieve similar tasks or domain rubrics first, and then generate the rubric for the current task.

RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation2024 · MethodRAG+2

RAGDomain knowledgeInterpretable evaluation

View source

Core idea

Augments rubric generation with retrieved domain knowledge or rubric exemplars before asking the model to produce the final criteria.

Why it matters

Addresses the shallowness of direct generation in professional domains where domain knowledge is essential.

Generation strategy

Preference-Driven Extraction

Rather than writing rubrics by hand, infer explicit criteria from preference pairs, judge data, or reward modeling signals.

Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling2024 · MethodPreference-driven+2

Preference-drivenReward modelingCriteria extraction

View source

Core idea

Reverse-engineers rubric criteria from preference pairs and reward modeling data so that criteria reflect what humans actually prefer.

Why it matters

Connects rubric construction with reward modeling and alignment pipelines, bridging evaluation and training-time signals.

Generation strategy

Refinement and Decomposition

Do not generate a rubric from scratch; instead decompose, rewrite, and de-bias an existing rubric.

Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks2024 · MethodDecomposition+2

DecompositionAtomic criteriaOpen-ended tasks

View source

Core idea

Reframes rubric generation as decomposition and rewriting: split coarse rubrics into finer atomic criteria and reduce redundancy.

Why it matters

Shows that rubric quality for open-ended tasks often depends more on decomposition than on the original draft.

Automated Refinement of Essay Scoring Rubrics via Reflect-and-Revise2024 · MethodRefinement+2

RefinementHuman feedbackEssay scoring

View source

Core idea

Iteratively refines a rubric by analyzing disagreement with human scores and rewriting the criteria accordingly.

Why it matters

Treats rubric design as prompt optimization guided by human-model disagreement.

iRULER: Iterative Rubric Refinement2024 · MethodIterative refinement+1

Iterative refinementJudge calibration

View source

Core idea

Iteratively revises evaluation criteria based on feedback signals and observed judging errors.

Why it matters

Provides a general template for feedback-driven rubric refinement beyond a single dataset.

Generation strategy

Human-in-the-Loop and Expert-Authored Rubrics

Rely on experts or expert-labeled rubrics, often supplemented by model-based generators for scale.

HealthBench: Evaluating Large Language Models Towards Improved Human Health2025 · OpenAI · BenchmarkExpert rubrics+2

Expert rubricsHealthcareBenchmark

View source

OpenAI

Core idea

Uses expert-authored rubrics with explicit items, scores, descriptions, and failure conditions for health-related LLM evaluation.

Why it matters

Strong example of why high-stakes domains require expert-defined rubrics with codebook-level detail.

Development of a Rubric for Assessing Delayed Diagnosis of Appendicitis, DKA, and SepsisClinical study · Clinical rubricDelphi method+2

Delphi methodConsensusClinical evaluation

View source

Core idea

Builds a clinical rubric through a multi-specialty Delphi panel, with iterative scoring and a 75% agreement threshold.

Why it matters

Demonstrates how consensus-driven processes produce trustworthy rubrics for sensitive domains.

Measuring What Matters: Developing Human-Centered Legal Q-and-A Quality Standards through Multi-Stakeholder ResearchRubric designMulti-stakeholder+2

Multi-stakeholderLegal AIHuman-centered

View source

Core idea

Involves legal experts, end users, and technical staff together when defining quality standards for legal AI answers.

Why it matters

Shows that rubric design for user-facing systems may require multi-stakeholder input, not just domain experts.

SedarEval: Automated Evaluation using Self-Adaptive Rubrics2025 · Xiaohongshu · MethodHuman-in-the-loop+2

Human-in-the-loopGeneratorSelf-adaptive

View source

Xiaohongshu

Core idea

Trains a rubric generator from expert-labeled rubrics, then uses it to scale up rubric-based scoring.

Why it matters

Practical human-in-the-loop pipeline that balances expert quality with automation.

XpertBench: Expert-Level Tasks with Rubrics-Based Evaluation2024 · ByteDance · BenchmarkBenchmark+2

BenchmarkExpert tasksRubric-based evaluation

View source

Core idea

Benchmark suite of expert-level tasks where rubrics are the primary evaluation mechanism.

Why it matters

Provides a shared testbed for expert-level rubric-based evaluation across tasks.

Setting

General-Purpose vs Domain-Specific Rubrics

Rubric design differs substantially between open-ended general tasks and specialized professional domains.

LLM-RUBRIC: A Multidimensional, Calibrated Approach to Automated Evaluation2023 · FrameworkGeneral-purpose+2

General-purposeDisagreement modelingMulti-dimensional

View source

Microsoft

Core idea

Treats rubric evaluation as a multi-dimensional scoring problem where human judges themselves disagree, and models this disagreement explicitly.

Why it matters

Motivates rubric designs that tolerate and model judge disagreement rather than assuming a single ground truth score.

How to add a new paper

Overview table — add an object to the papers array at the top of this file:

{
  title: "Paper Title",
  href:  "https://arxiv.org/abs/xxxx",
  org:   "Institution",
  year:  "2025",
  usage: "Rubric as Reward",   // tag for 如何使用 Rubrics
  stage: "Training-Time",      // tag for 按模型训练阶段
}

Annotated card — add a <PaperCard> inside the matching <PaperGroup>:

<PaperCard
  title="…"
  href="…"
  venue="…" year="…" type="…"
  idea="…"
  why="…"
  tags={['…']}
/>

Overview​

Annotated Reading List​

How to add a new paper​

Overview

Annotated Reading List

How to add a new paper