This repository collects methods for evaluating visual generation.
Within this repository, we collect works that aim to answer some critical questions in the field of evaluating visual generation, such as:
- Model Evaluation: How does one determine the quality of a specific image or video generation model?
- Sample/Content Evaluation: What methods can be used to evaluate the quality of a particular generated image or video?
- User Control Consistency Evaluation: How to tell how well the generated images and videos align with the user controls or inputs?
This repository is updated periodically. If you have suggestions for additional resources, updates on methodologies, or fixes for expiring links, please feel free to do any of the following:
- raise an Issue,
- nominate awesome related works with Pull Requests,
- We are also contactable via email (
ZIQI002 at e dot ntu dot edu dot sg
).
- 1. Evaluation Metrics of Generative Models
- 2. Evaluation Metrics of Condition Consistency
- 3. Evaluation Systems of Generative Models
- 4. Improving Visual Generation with Evaluation / Feedback / Reward
- 5. Quality Assessment for AIGC
- 6. Study and Rethinking
- 7. Other Useful Resources
Metric | Paper | Code |
---|---|---|
Inception Score (IS) | Improved Techniques for Training GANs (NeurIPS 2016) | |
Fréchet Inception Distance (FID) | GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017) | |
Kernel Inception Distance (KID) | Demystifying MMD GANs (ICLR 2018) | |
CLIP-FID | The Role of ImageNet Classes in Fréchet Inception Distance (ICLR 2023) | |
Precision-and-Recall | Assessing Generative Models via Precision and Recall (2018-05-31, NeurIPS 2018) Improved Precision and Recall Metric for Assessing Generative Models (NeurIPS 2019) |
|
Renyi Kernel Entropy (RKE) | An Information-Theoretic Evaluation of Generative Models in Learning Multi-modal Distributions (NeurIPS 2023) | |
CLIP Maximum Mean Discrepancy (CMMD) | Rethinking FID: Towards a Better Evaluation Metric for Image Generation (CVPR 2024) |
-
Global-Local Image Perceptual Score (GLIPS): Evaluating Photorealistic Quality of AI-Generated Images (2024-05-15)
-
Unifying and extending Precision Recall metrics for assessing generative models (2024-05-02)
-
Virtual Classifier Error (VCE) from Virtual Classifier: A Reversed Approach for Robust Image Evaluation (2024-03-04)
-
An Interpretable Evaluation of Entropy-based Novelty of Generative Models (2024-02-27)
-
Attribute Based Interpretable Evaluation Metrics for Generative Models (2023-10-26)
-
LGSQE: Lightweight Generated Sample Quality Evaluatoin (2022-11-08)
-
Rarity Score: A New Metric to Evaluate the Uncommonness of Synthesized Images (2022-06-17)
-
TREND: Truncated Generalized Normal Density Estimation of Inception Embeddings for GAN Evaluation (2021-04-30, ECCV 2022)
-
CFID from Conditional Frechet Inception Distance (2021-03-21)
-
CIS from Evaluation Metrics for Conditional Image Generation (2020-04-26)
-
Text-To-Image Synthesis Method Evaluation Based On Visual Patterns (2020-04-09)
-
SceneFID from Object-Centric Image Generation from Layouts (2020-03-16)
-
Reliable Fidelity and Diversity Metrics for Generative Models (2020-02-23, ICML 2020)
-
Effectively Unbiased FID and Inception Score and where to find them (2019-11-16, CVPR 2020)
Metric | Paper | Code |
---|---|---|
FID-vid | GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017) | |
Fréchet Video Distance (FVD) | Towards Accurate Generative Models of Video: A New Metric & Challenges (arXiv 2018) FVD: A new Metric for Video Generation (2019-05-04) (Note: ICLR 2019 Workshop DeepGenStruct Program Chairs) |
- Linear Separability & Perceptual Path Length (PPL) from A Style-Based Generator Architecture for Generative Adversarial Networks (2020-01-09)
Metric | Condition | Pipeline | Code | References |
---|---|---|---|---|
CLIP Score (a.k.a. CLIPSIM) |
Text | cosine similarity between the CLIP image and text embeddings | PyTorch Lightning | CLIP Paper (ICML 2021). Metrics first used in CLIPScore Paper (arXiv 2021) and GODIVA Paper (arXiv 2021) applies it in video evaluation. |
Mask Accuracy | Segmentation Mask | predict the segmentatio mask, and compute pixel-wise accuracy against the ground-truth segmentation mask | any segmentation method for your setting | |
DINO Similarity | Image of a Subject (human / object etc) | cosine similarity between the DINO embeddings of the generated image and the condition image | DINO paper. Metric is proposed in DreamBooth. |
-
Manipulation Direction (MD) from Manipulation Direction: Evaluating Text-Guided Image Manipulation Based on Similarity between Changes in Image and Text Modalities (2023-11-20)
-
Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation (2022-12-02)
-
Visual-Semantic (VS) Similarity from Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network (2018-12-26)
-
Semantically Invariant Text-to-Image Generation (2018-09-06)
Note: They evaluate image-text similarity via image captioning
-
Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis (2018-01-16)
Note: An object detector based metric is proposed.
Metrics | Paper | Code |
---|---|---|
Learned Perceptual Image Patch Similarity (LPIPS) | The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (2018-01-11) (CVPR 2018) | |
Structural Similarity Index (SSIM) | Image quality assessment: from error visibility to structural similarity (TIP 2004) | |
Peak Signal-to-Noise Ratio (PSNR) | - | |
Multi-Scale Structural Similarity Index (MS-SSIM) | Multiscale structural similarity for image quality assessment (SSC 2004) | PyTorch-Metrics |
Feature Similarity Index (FSIM) | FSIM: A Feature Similarity Index for Image Quality Assessment (TIP 2011) |
The community has also been using DINO or CLIP features to measure the semantic similarity of two images / frames.
There are also recent works on new methods to measure visual similarity (more will be added):
-
Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability (2023-12-17, CVPR 2024)
-
HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models (2019-04-01)
-
Multidimensional Preference Score from Learning Multi-dimensional Human Preference for Text-to-Image Generation (2024-05-23)
-
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings (2024-04-25)
-
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation (2024-04-23)
-
TAVGBench: Benchmarking Text to Audible-Video Generation (2024-04-22)
-
Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control (2024-04-21)
-
GenAI-Bench: A Holistic Benchmark for Compositional Text-to-Visual Generation (2024-04-09)
Note: GenAI-Bench was introduced in a previous paper 'Evaluating Text-to-Visual Generation with Image-to-Text Generation'
-
Evaluating Text-to-Visual Generation with Image-to-Text Generation (2024-04-01)
-
FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models (2024-03-25)
-
Exploring GPT-4 Vision for Text-to-Image Synthesis Evaluation (2024-03-20)
-
Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis (2024-03-08)
-
An Information-Theoretic Evaluation of Generative Models in Learning Multi-modal Distributions (2024-02-13)
-
CAS: A Probability-Based Approach for Universal Condition Alignment Score (2024-01-16)
Note: Condition alignment of text-to-image, {instruction, image}-to-image, edge-/scribble-to-image, and text-to-audio
-
VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation (2023-12-22)
-
Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods (2023-12-11)
-
A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics (2023-12-04)
-
SelfEval: Leveraging the discriminative nature of generative models for evaluation (2023-11-17)
-
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks (2023-11-02)
-
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation (2023-10-27, ICLR 2024)
-
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design (2023-10-23)
-
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment (2023-10-17)
-
Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet Hierarchy (2023-10-13)
-
ImagenHub: Standardizing the evaluation of conditional image generation models (2023-10-02)
GenAI-Arena -
Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation (2023-09-26, ICLR 2024)
-
Concept Score from Text-to-Image Generation for Abstract Concepts (2023-09-26)
-
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation (2023-09-23)
NOTE: evaluates task of image and text generation
-
LEICA from Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment (2023-08-16)
-
Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation (2023-07-18)
-
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation (2023-07-12)
-
TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation (2023-07-11, WACV 2024)
-
Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback (2023-07-10, NeurIPS 2023)
-
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis (2023-06-15)
-
ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models (2023-06-07, AAAI 2024)
-
Visual Programming for Text-to-Image Generation and Evaluation (2023-05-24, NeurIPS 2023)
-
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation (2023-05-18, NeurIPS 2023)
-
X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models (2023-05-18)
-
What You See is What You Read? Improving Text-Image Alignment Evaluation (2023-05-17, NeurIPS 2023)
-
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation (2023-05-02)
-
HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models (2023-04-11, ICCV 2023)
-
Human Preference Score: Better Aligning Text-to-Image Models with Human Preference (2023-03-25, ICCV 2023)
-
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering (2023-03-21, ICCV 2023)
-
Benchmarking Spatial Relationships in Text-to-Image Generation (2022-12-20)
-
MMI and MOR from from Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift (2022-12-15)
-
Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark (2022-11-22)
-
Re-Imagen: Retrieval-Augmented Text-to-Image Generator (2022-09-29)
-
DrawBench from Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (2022-05-23)
-
Benchmark for Compositional Text-to-Image Synthesis (2021-07-29)
-
TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation (2021-12-02, ECCV 2022)
-
Semantic Object Accuracy for Generative Text-to-Image Synthesis (2019-10-29)
-
Diffusion Model-Based Image Editing: A Survey (2024-02-27)
-
LEDITS++: Limitless Image Editing using Text-to-Image Models (2023-11-28)
-
Emu Edit: Precise Image Editing via Recognition and Generation Tasks (2023-11-16)
-
EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods (2023-10-03)
-
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing (2023-06-16)
NOTE: dataset only
-
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting (2022-12-13, CVPR 2023)
-
Imagic: Text-Based Real Image Editing with Diffusion Models (2022-10-17)
-
Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model (2021-11-26)
-
Exposing AI-generated Videos: A Benchmark Dataset and a Local-and-Global Temporal Defect Based Detection Method (2024-05-07)
-
Sora Detector: A Unified Hallucination Detection for Large Text-to-Video Models (2024-05-07)
NOTE: hallucination detection
-
Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap (2024-04-21)
-
Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment (2024-03-18)
-
Sora Generates Videos with Stunning Geometrical Consistency (2024-02-27)
-
STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models (2024-01-30)
-
Towards A Better Metric for Text-to-Video Generation (2024-01-15)
-
VBench: Comprehensive Benchmark Suite for Video Generative Models (2023-11-29)
-
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation (2023-11-03)
-
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models (2023-10-17)
-
Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset (2023-09-14)
-
StoryBench: A Multifaceted Benchmark for Continuous Story Visualization (2023-08-22, NeurIPS 2023)
-
CelebV-Text: A Large-Scale Facial Text-Video Dataset (2023-03-26, CVPR 2023)
-
I2V-Bench from ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation (2024-02-06)
-
AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI (2024-01-03)
-
VBench-I2V (2024-03) from VBench: Comprehensive Benchmark Suite for Video Generative Models (2023-11-29)
-
A Benchmark for Controllable Text-Image-to-Video Generation (2023-06-12)
-
OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance (2024-05-23)
-
Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation (2024-05-07)
-
THQA: A Perceptual Quality Assessment Database for Talking Heads (2024-04-13)
-
A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos (2024-03-11)
-
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert (2023-03-29, CVPR 2023)
-
MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization (2024-05-06)
-
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language (2023-05-25)
-
FAIntbench: A Holistic and Precise Benchmark for Bias Evaluation in Text-to-Image Models (2024-05-28)
-
Condition Likelihood Discrepancy from Membership Inference on Text-to-Image Diffusion Models via Conditional Likelihood Discrepancy (2024-05-23)
-
Towards Geographic Inclusion in the Evaluation of Text-to-Image Models (2024-05-07)
-
UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images (2024-05-06)
-
Survey of Bias In Text-to-Image Generation: Definition, Evaluation, and Mitigation (2024-04-01)
-
VBench-Trustworthiness (2024-03) from VBench: Comprehensive Benchmark Suite for Video Generative Models (2023-11-29)
-
Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts (2024-03-17, NAACL 2024)
-
Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis (2024-03-08)
-
ViSAGe: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation (2024-01-02)
-
Distribution Bias, Jaccard Hallucination, Generative Miss Rate from Quantifying Bias in Text-to-Image Generative Models (2023-12-20)
-
Holistic Evaluation of Text-To-Image Models (2023-11-07)
-
Sociotechnical Safety Evaluation of Generative AI Systems (2023-10-18)
-
Navigating Cultural Chasms: Exploring and Unlocking the Cultural POV of Text-To-Image Models (2023-10-03)
-
DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity (2023-08-11)
-
On the Cultural Gap in Text-to-Image Generation (2023-07-06)
-
Disparities in Text-to-Image Model Concept Possession Across Languages (2023-06-12)
-
Multilingual Conceptual Coverage in Text-to-Image Models (2023-06-02, ACL 2023)
-
T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation (2023-06-01)
-
Inspecting the Geographical Representativeness of Images from Text-to-Image Models (2023-05-18)
-
How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? (2022-10-27)
-
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models (2022-02-08, ICCV 2023)
Not for visual generation, but related evaluations of other models like LLMs
-
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal (2024-02-06)
-
FACET: Fairness in Computer Vision Evaluation Benchmark (2023-08-31)
-
Gender Biases in Automatic Evaluation Metrics for Image Captioning (2023-05-24)
-
Fairness Indicators for Systematic Assessments of Visual Feature Extractors (2022-02-15)
-
Scene Graph(SG)-IoU, Relation-IoU, and Entity-IoU (using GPT-4v) from SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance (2024-05-24)
-
Relation Accuracy & Entity Accuracy from ReVersion: Diffusion-Based Relation Inversion from Images (2023-03-23)
-
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback (2024-05-29)
-
Class-Conditional self-reward mechanism for improved Text-to-Image models (2024-05-22)
-
Understanding and Evaluating Human Preferences for AI Generated Images with Instruction Tuning (2024-05-12)
-
Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models (2024-05-01)
-
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning (2024-04-23)
-
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback (2024-04-11)
-
UniFL: Improve Stable Diffusion via Unified Feedback Learning (2024-04-08)
-
ByteEdit: Boost, Comply and Accelerate Generative Image Editing (2024-04-07)
-
Aligning Diffusion Models by Optimizing Human Utility (2024-04-06)
-
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching (2024-04-04)
-
VersaT2I: Improving Text-to-Image Models with Versatile Reward (2024-03-27)
-
Improving Text-to-Image Consistency via Automatic Prompt Optimization (2024-03-26)
-
RL for Consistency Models: Faster Reward Guided Text-to-Image Generation (2024-03-25)
-
AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation (2024-03-20)
-
Reward Guided Latent Consistency Distillation (2024-03-16)
-
A Dense Reward View on Aligning Text-to-Image Diffusion with Preference (2024-02-13, ICML 2024)
-
InstructVideo: Instructing Video Diffusion Models with Human Feedback (2023-12-19)
-
Rich Human Feedback for Text-to-Image Generation (2023-12-15, CVPR 2024)
-
InstructBooth: Instruction-following Personalized Text-to-Image Generation (2023-12-04)
-
DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback (2023-11-29)
-
Diffusion Model Alignment Using Direct Preference Optimization (2023-11-21)
-
Aligning Text-to-Image Diffusion Models with Reward Backpropagation (2023-10-05)
-
Directly Fine-Tuning Diffusion Models on Differentiable Rewards (2023-09-29)
-
Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback (2023-07-10, NeurIPS 2023)
-
DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models (2023-05-25, NeurIPS 2023)
-
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation (2023-04-12)
-
Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models (2023-04-02, ICLR 2024)
-
Human Preference Score: Better Aligning Text-to-Image Models with Human Preference (2023-03-25)
-
Aligning Text-to-Image Models using Human Feedback (2023-02-23)
-
Descriptive Image Quality Assessment in the Wild (2024-05-29)
-
Large Multi-modality Model Assisted AI-Generated Image Quality Assessment (2024-04-27)
-
Adaptive Mixed-Scale Feature Fusion Network for Blind AI-Generated Image Quality Assessment (2024-04-23)
-
PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition (2024-04-20)
-
AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment (2024-04-04)
-
AIGCOIQA2024: Perceptual Quality Assessment of AI Generated Omnidirectional Images (2024-04-01)
-
Bringing Textual Prompt to AI-Generated Image Quality Assessment (2024-03-27, ICME 2024)
-
TIER: Text-Image Encoder-based Regression for AIGC Image Quality Assessment (2024-01-08)
-
Exploring the Naturalness of AI-Generated Images (2023-12-09)
-
PKU-I2IQA: An Image-to-Image Quality Assessment Database for AI Generated Images (2023-11-27)
-
AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment (2023-06-07)
-
A Perceptual Quality Assessment Exploration for AIGC Images (2023-03-22)
-
GIQA: Generated Image Quality Assessment (2020-03-19)
-
Multi-modal Learnable Queries for Image Aesthetics Assessment (2024-05-02, ICME 2024)
-
Aesthetic Scorer extension for SD Automatic WebUI (2023-01-15)
-
Rethinking Image Aesthetics Assessment: Models, Datasets and Benchmarks (2022-07-01)
-
LAION-Aesthetics_Predictor V2: CLIP+MLP Aesthetic Score Predictor (2022-06-26)
- Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) (2024-04-05)
Note: Refer to table 2 for evaluation metrics for long video generation
-
On the Content Bias in Fréchet Video Distance (2024-04-18, CVPR 2024)
-
On the Evaluation of Generative Models in Distributed Learning Tasks (2023-10-18)
-
Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models (2023-06-07, NeurIPS 2023)
-
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation (2023-04-04, CVPR 2023)
-
Revisiting the Evaluation of Image Synthesis with GANs (2023-04-04)
-
A Study on the Evaluation of Generative Models (2022-06-22)
-
On the Robustness of Quality Measures for GANs (2022-01-31, ECCV 2022)
-
A Note on the Inception Score (2018-01)
-
An empirical study on evaluation metrics of generative adversarial networks (2018-06-19)
- Stanford Course: CS236 "Deep Generative Models" - Lecture 15 "Evaluation of Generative Models" [slides]