Skip to main content

TTV: Towards advancing text-to-video generation with generative AI models and a comprehensive study of model fidelity, performance, and human perception

Text-to-video generation has rapidly evolved as a groundbreaking application of generative AI, with the potential to revolutionize both creative and industrial sectors. Despite these advancements, the fidelity, performance, and real-world applicability of current models remain inadequately explored. This research aims to address this gap by evaluating the performance of three cutting-edge text-to-video models: Runway Gen2, CogVideoX-2B, and CogVideoX-5B. The primary objectives of this study are to (1) conduct a comprehensive evaluation of these models using rigorous mathematical assessments such as Frechet Inception Distance (FID), Frechet Video Distance (FVD), and CLIPScore to measure video quality, realism, and alignment with text input; (2) gather human perceptual data to assess perceived realism, quality, and accuracy; and (3) compare the models to identify strengths, weaknesses, and areas for improvement. By combining quantitative evaluations with human insights, this research provides a well-rounded perspective on the model's performance. The results reveal significant performance differences across the models, shedding light on both their technical capabilities and how they align with user expectations. This research contributes by providing a rigorous benchmark of current text-to-video models and underscoring the importance of human perception in guiding future advancements in text-to-video generation.

Tasnim Akter Onisha
Georgia Southern Univeristy
United States
to04386@georgiasouthern.edu

 

Hayden Wimmer
Georgia Southern Univeristy
United States
hwimmer@georgiasouthern.edu

 

Carl Rebman
University of San Diego
United States
carlr@sandiego.edu