VideoGen-RewardBench: Evaluating Reward Models for Video Generation

Evaluating the capabilities of reward models for video generation.

Code | Project | Eval. Dataset | Paper | Total models: 5 | * Unverified models | ⚠️ Dataset Contamination | Last restart (PST): 20:15 PST, 09 Feb 2025

Model Types
Model
Model Type
Avg.
Avg. (w/o Ties)
Avg. (w/ Ties)
Overall (w/o Ties)
VQ (w/o Ties)
MQ (w/o Ties)
TA (w/o Ties)
Overall (w/ Ties)
VQ (w/ Ties)
MQ (w/ Ties)
TA (w/ Ties)
0
Seq. Classifiers
67.12
74.04
60.19
73.59
75.66
60.98
61.15
61.26
59.68
66.03
46.56

Overview

The prompt-video pairs are sourced from VideoGen-Eval, a dataset featuring a diverse range of prompts and videos generated by state-of-the-art video diffusion models (VDMs). Our benchmark comprises 26.5k video pairs, each annotated with a corresponding preference label.

We report two accuracy metrics: ties-included accuracy (w/ Ties) and ties-excluded accuracy (w/o Ties).

  • For ties-excluded accuracy, we exclude all data labeled as ”ties” and use only data labeled as ”A wins” or ”B wins” for calculation. We compute the rewards for each prompt-video pair, convert the relative reward relationships into binary labels, and calculate classification accuracy.
  • For ties-included accuracy, we adopt Algorithm 1 proposed by Ties Matter. This method traverses all possible tie thresholds, calculates three-class accuracy for each threshold, and selects the highest accuracy as the final metric. See calc_accuracy for the implementation of ties-included accuracy.

We include multiple types of reward models in this evaluation:

  1. Sequence Classifiers (Seq. Classifier): A model that takes in a prompt and a video and outputs a score.
  2. Custom Classifiers: Research models with different architectures and training objectives.
  3. Random: Random choice baseline.
  4. Generative: Prompting fine-tuned models to choose between two answers.

Note: Models with (*) after the Model are independently submitted model scores which have not been verified by the VideoGen-RewardBench team.

Acknowledgments

Our leaderboard is built on RewardBench. The prompt-video pairs are sourced from VideoGen-Eval. We sincerely thank all the contributors!