Paper List
An overview table followed by taxonomy-grouped annotations. Click any row in the table to open the source link directly; expand the cards below for detailed notes.
Overview
Annotated Reading List
Generation strategy
Direct Generation
Generate rubrics directly from the task description, scoring goals, and a few examples.
Treats rubric generation as a conditional generation problem: task description and scoring target go in, a query-specific rubric comes out.
Establishes a baseline for automatic, query-conditioned rubric generation and shows where direct LLM generation struggles in specialized domains.
Adaptive rubric generation framework that adjusts criteria to the current task and scoring context.
Illustrates how to push beyond a single static rubric toward task-aware criteria selection.
Generation strategy
Retrieval-Augmented Generation
Retrieve similar tasks or domain rubrics first, and then generate the rubric for the current task.
Augments rubric generation with retrieved domain knowledge or rubric exemplars before asking the model to produce the final criteria.
Addresses the shallowness of direct generation in professional domains where domain knowledge is essential.
Generation strategy
Preference-Driven Extraction
Rather than writing rubrics by hand, infer explicit criteria from preference pairs, judge data, or reward modeling signals.
Reverse-engineers rubric criteria from preference pairs and reward modeling data so that criteria reflect what humans actually prefer.
Connects rubric construction with reward modeling and alignment pipelines, bridging evaluation and training-time signals.
Generation strategy
Refinement and Decomposition
Do not generate a rubric from scratch; instead decompose, rewrite, and de-bias an existing rubric.
Reframes rubric generation as decomposition and rewriting: split coarse rubrics into finer atomic criteria and reduce redundancy.
Shows that rubric quality for open-ended tasks often depends more on decomposition than on the original draft.
Iteratively refines a rubric by analyzing disagreement with human scores and rewriting the criteria accordingly.
Treats rubric design as prompt optimization guided by human-model disagreement.
Iteratively revises evaluation criteria based on feedback signals and observed judging errors.
Provides a general template for feedback-driven rubric refinement beyond a single dataset.
Generation strategy
Human-in-the-Loop and Expert-Authored Rubrics
Rely on experts or expert-labeled rubrics, often supplemented by model-based generators for scale.
Uses expert-authored rubrics with explicit items, scores, descriptions, and failure conditions for health-related LLM evaluation.
Strong example of why high-stakes domains require expert-defined rubrics with codebook-level detail.
Builds a clinical rubric through a multi-specialty Delphi panel, with iterative scoring and a 75% agreement threshold.
Demonstrates how consensus-driven processes produce trustworthy rubrics for sensitive domains.
Involves legal experts, end users, and technical staff together when defining quality standards for legal AI answers.
Shows that rubric design for user-facing systems may require multi-stakeholder input, not just domain experts.
Trains a rubric generator from expert-labeled rubrics, then uses it to scale up rubric-based scoring.
Practical human-in-the-loop pipeline that balances expert quality with automation.
Benchmark suite of expert-level tasks where rubrics are the primary evaluation mechanism.
Provides a shared testbed for expert-level rubric-based evaluation across tasks.
Setting
General-Purpose vs Domain-Specific Rubrics
Rubric design differs substantially between open-ended general tasks and specialized professional domains.
Treats rubric evaluation as a multi-dimensional scoring problem where human judges themselves disagree, and models this disagreement explicitly.
Motivates rubric designs that tolerate and model judge disagreement rather than assuming a single ground truth score.
How to add a new paper
Overview table — add an object to the papers array at the top of this file:
{
title: "Paper Title",
href: "https://arxiv.org/abs/xxxx",
org: "Institution",
year: "2025",
usage: "Rubric as Reward", // tag for 如何使用 Rubrics
stage: "Training-Time", // tag for 按模型训练阶段
}
Annotated card — add a <PaperCard> inside the matching <PaperGroup>:
<PaperCard
title="…"
href="…"
venue="…" year="…" type="…"
idea="…"
why="…"
tags={['…']}
/>