VideoGen-RewardBench: Evaluating Reward Models for Video Generation
Evaluating the capabilities of reward models for video generation.
Code | Project | Eval. Dataset | Paper | Total models: 5 | * Unverified models | ⚠️ Dataset Contamination | Last restart (PST): 20:15 PST, 09 Feb 2025
Model | Model Type | Avg. | Avg. (w/o Ties) | Avg. (w/ Ties) | Overall (w/o Ties) | VQ (w/o Ties) | MQ (w/o Ties) | TA (w/o Ties) | Overall (w/ Ties) | VQ (w/ Ties) | MQ (w/ Ties) | TA (w/ Ties) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
<a target="_blank" href="https://huggingface.co/THUDM/VisionReward-Video" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">VisionRewrd</a> | Seq. Classifiers | 67.12 | 74.04 | 60.19 | 73.59 | 75.66 | 60.98 | 61.15 | 61.26 | 59.68 | 66.03 | 46.56 |
Model | Model Type | Avg. | Avg. (w/o Ties) | Avg. (w/ Ties) | Overall (w/o Ties) | VQ (w/o Ties) | MQ (w/o Ties) | TA (w/o Ties) | Overall (w/ Ties) | VQ (w/ Ties) | MQ (w/ Ties) | TA (w/ Ties) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Seq. Classifiers | 67.12 | 74.04 | 60.19 | 73.59 | 75.66 | 60.98 | 61.15 | 61.26 | 59.68 | 66.03 | 46.56 |
Model | Model Type | Avg. | Avg. (w/o Ties) | Avg. (w/ Ties) | Overall (w/o Ties) | VQ (w/o Ties) | MQ (w/o Ties) | TA (w/o Ties) | Overall (w/ Ties) | VQ (w/ Ties) | MQ (w/ Ties) | TA (w/ Ties) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Seq. Classifiers | 67.12 | 74.04 | 60.19 | 73.59 | 75.66 | 74.70 | 72.20 | 61.26 | 59.68 | 66.03 | 53.80 | |
1 | Seq. Classifiers | 57.32 | 62.19 | 52.45 | 67.59 | 59.03 | 60.98 | 61.15 | 56.77 | 47.43 | 59.03 | 46.56 | |
2 | Generative | 50.38 | 55.89 | 44.86 | 57.26 | 55.97 | 54.91 | 55.43 | 39.08 | 47.53 | 59.04 | 33.79 | |
3 | Others | 48.22 | 50.05 | 46.40 | 50.30 | 49.86 | 49.64 | 50.40 | 41.86 | 47.42 | 59.07 | 37.25 | |
4 | Seq. Classifiers | 48.11 | 49.84 | 46.38 | 50.22 | 47.72 | 51.09 | 50.34 | 41.80 | 47.41 | 59.05 | 37.24 |
Overview
The prompt-video pairs are sourced from VideoGen-Eval, a dataset featuring a diverse range of prompts and videos generated by state-of-the-art video diffusion models (VDMs). Our benchmark comprises 26.5k video pairs, each annotated with a corresponding preference label.
We report two accuracy metrics: ties-included accuracy (w/ Ties) and ties-excluded accuracy (w/o Ties).
- For ties-excluded accuracy, we exclude all data labeled as ”ties” and use only data labeled as ”A wins” or ”B wins” for calculation. We compute the rewards for each prompt-video pair, convert the relative reward relationships into binary labels, and calculate classification accuracy.
- For ties-included accuracy, we adopt Algorithm 1 proposed by Ties Matter. This method traverses all possible tie thresholds, calculates three-class accuracy for each threshold, and selects the highest accuracy as the final metric. See calc_accuracy for the implementation of ties-included accuracy.
We include multiple types of reward models in this evaluation:
- Sequence Classifiers (Seq. Classifier): A model that takes in a prompt and a video and outputs a score.
- Custom Classifiers: Research models with different architectures and training objectives.
- Random: Random choice baseline.
- Generative: Prompting fine-tuned models to choose between two answers.
Note: Models with (*) after the Model are independently submitted model scores which have not been verified by the VideoGen-RewardBench team.
Acknowledgments
Our leaderboard is built on RewardBench. The prompt-video pairs are sourced from VideoGen-Eval. We sincerely thank all the contributors!
How to Submit Your Results on VideoGen-RewardBench
Please follow the steps below to submit your reward model's results:
Step 1: Create an Issue
Open an issue in the VideoAlign GitHub repository.
Step 2: Calculate Accuracy Metrics
Use our provided scripts to compute your model's accuracy:
- Ties-Included Accuracy (w/ Ties): Use calc_accuracy_with_ties
- Ties-Excluded Accuracy (w/o Ties): Use calc_accuracy_without_ties
Step 3: Provide Your Results in the Issue
Within the issue, include your reward model's results in JSON format. For example:
{
"with_tie": {
"overall": 61.26,
"vq": 59.68,
"mq": 66.03,
"ta": 53.80
},
"without_tie": {
"overall": 73.59,
"vq": 75.66,
"mq": 74.70,
"ta": 72.20
},
"model": "VideoReward",
"model_link": "https://huggingface.co/KwaiVGI/VideoReward",
"model_type": "Seq. Classifiers"
}
Additionally, please include any relevant information about your model (e.g., a brief description, methodology, etc.).
Step 4: Review and Leaderboard Update
We will review your issue promptly and update the leaderboard accordingly.
Random Dataset Sample Viewer
{sampled data loads here}