Gradio

Model	Model Type	Avg.	Avg. (w/o Ties)	Avg. (w/ Ties)	Overall (w/o Ties)	VQ (w/o Ties)	MQ (w/o Ties)	TA (w/o Ties)	Overall (w/ Ties)	VQ (w/ Ties)	MQ (w/ Ties)	TA (w/ Ties)
<a target="_blank" href="https://huggingface.co/THUDM/VisionReward-Video" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">VisionRewrd</a>	Seq. Classifiers	67.12	74.04	60.19	73.59	75.66	60.98	61.15	61.26	59.68	66.03	46.56

Model	Model Type	Avg.	Avg. (w/o Ties)	Avg. (w/ Ties)	Overall (w/o Ties)	VQ (w/o Ties)	MQ (w/o Ties)	TA (w/o Ties)	Overall (w/ Ties)	VQ (w/ Ties)	MQ (w/ Ties)	TA (w/ Ties)
<a target="_blank" href="https://huggingface.co/KwaiVGI/VideoReward" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">VideoReward</a>	Seq. Classifiers	67.12	74.04	60.19	73.59	75.66	74.7	72.2	61.26	59.68	66.03	53.8
<a target="_blank" href="https://huggingface.co/THUDM/VisionReward-Video" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">VisionRewrd</a>	Seq. Classifiers	57.32	62.19	52.45	67.59	59.03	60.98	61.15	56.77	47.43	59.03	46.56
<a target="_blank" href="https://huggingface.co/Fudan-FUXI/LiFT-Critic-40b-lora" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">LiFT</a>	Generative	50.38	55.89	44.86	57.26	55.97	54.91	55.43	39.08	47.53	59.04	33.79
<a target="_blank" href="" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">Random</a>	Others	48.22	50.05	46.4	50.3	49.86	49.64	50.4	41.86	47.42	59.07	37.25
<a target="_blank" href="https://huggingface.co/TIGER-Lab/VideoScore" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">VideoScore</a>	Seq. Classifiers	48.11	49.84	46.38	50.22	47.72	51.09	50.34	41.8	47.41	59.05	37.24

	Model	Model Type	Avg.	Avg. (w/o Ties)	Avg. (w/ Ties)	Overall (w/o Ties)	VQ (w/o Ties)	MQ (w/o Ties)	TA (w/o Ties)	Overall (w/ Ties)	VQ (w/ Ties)	MQ (w/ Ties)	TA (w/ Ties)
0	VisionRewrd	Seq. Classifiers	67.12	74.04	60.19	73.59	75.66	60.98	61.15	61.26	59.68	66.03	46.56

	Model	Model Type	Avg.	Avg. (w/o Ties)	Avg. (w/ Ties)	Overall (w/o Ties)	VQ (w/o Ties)	MQ (w/o Ties)	TA (w/o Ties)	Overall (w/ Ties)	VQ (w/ Ties)	MQ (w/ Ties)	TA (w/ Ties)
0	VideoReward	Seq. Classifiers	67.12	74.04	60.19	73.59	75.66	74.70	72.20	61.26	59.68	66.03	53.80
1	VisionRewrd	Seq. Classifiers	57.32	62.19	52.45	67.59	59.03	60.98	61.15	56.77	47.43	59.03	46.56
2	LiFT	Generative	50.38	55.89	44.86	57.26	55.97	54.91	55.43	39.08	47.53	59.04	33.79
3	Random	Others	48.22	50.05	46.40	50.30	49.86	49.64	50.40	41.86	47.42	59.07	37.25
4	VideoScore	Seq. Classifiers	48.11	49.84	46.38	50.22	47.72	51.09	50.34	41.80	47.41	59.05	37.24

Overview

The prompt-video pairs are sourced from VideoGen-Eval, a dataset featuring a diverse range of prompts and videos generated by state-of-the-art video diffusion models (VDMs). Our benchmark comprises 26.5k video pairs, each annotated with a corresponding preference label.

We report two accuracy metrics: ties-included accuracy (w/ Ties) and ties-excluded accuracy (w/o Ties).

For ties-excluded accuracy, we exclude all data labeled as ”ties” and use only data labeled as ”A wins” or ”B wins” for calculation. We compute the rewards for each prompt-video pair, convert the relative reward relationships into binary labels, and calculate classification accuracy.
For ties-included accuracy, we adopt Algorithm 1 proposed by Ties Matter. This method traverses all possible tie thresholds, calculates three-class accuracy for each threshold, and selects the highest accuracy as the final metric. See calc_accuracy for the implementation of ties-included accuracy.

We include multiple types of reward models in this evaluation:

Sequence Classifiers (Seq. Classifier): A model that takes in a prompt and a video and outputs a score.
Custom Classifiers: Research models with different architectures and training objectives.
Random: Random choice baseline.
Generative: Prompting fine-tuned models to choose between two answers.

Note: Models with (*) after the Model are independently submitted model scores which have not been verified by the VideoGen-RewardBench team.

Acknowledgments

Our leaderboard is built on RewardBench. The prompt-video pairs are sourced from VideoGen-Eval. We sincerely thank all the contributors!

How to Submit Your Results on VideoGen-RewardBench

Please follow the steps below to submit your reward model's results:

Step 1: Create an Issue

Open an issue in the VideoAlign GitHub repository.

Step 2: Calculate Accuracy Metrics

Use our provided scripts to compute your model's accuracy:

Ties-Included Accuracy (w/ Ties): Use calc_accuracy_with_ties
Ties-Excluded Accuracy (w/o Ties): Use calc_accuracy_without_ties

Step 3: Provide Your Results in the Issue

Within the issue, include your reward model's results in JSON format. For example:

{
    "with_tie": {
        "overall": 61.26,
        "vq": 59.68,
        "mq": 66.03,
        "ta": 53.80
    },
    "without_tie": {
        "overall": 73.59,
        "vq": 75.66,
        "mq": 74.70,
        "ta": 72.20
    },
    "model": "VideoReward",
    "model_link": "https://huggingface.co/KwaiVGI/VideoReward",
    "model_type": "Seq. Classifiers"
}

Additionally, please include any relevant information about your model (e.g., a brief description, methodology, etc.).

Step 4: Review and Leaderboard Update

We will review your issue promptly and update the leaderboard accordingly.

Random Dataset Sample Viewer

Subset

{sampled data loads here}

Video

VideoGen-RewardBench: Evaluating Reward Models for Video Generation

Evaluating the capabilities of reward models for video generation.