We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
A model with 30B parameters generates up to 102 frames of video from text and images, setting a new benchmark in text-driven image-to-video generation.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 54
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2503.11251ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
54Daxin JiangLiangyu ChenJing LiJie WuChenfei WuTianyu WangZheng GeRanchen MingXianfang ZengYibo ZhuBinxing JiaoXiangyu ZhangGang YuHaoyang HuangNan DuanBo wangShuchang ZhouHeung-Yeung ShumWei ChenShengming YinYu ZhouWei JiWen SunJiansheng ChenXinhao ZhangGuoqing MaXin HanBrian LiChangyi WanDeshan SunGuanzhe HuangKaijun TanKang AnXing ChenYuchu LuoAojie LiYuhe YinJianchang WuJunjing GuoJiashuai LiuQiling WuRan SunShanshan YuanSitong LiuYaqi DaiZhisheng GuanXiaoniu SongBizhu HuangHuixin XiongQiaohui ChenZhiying LuJiaxin HeJianlong YuanYi Xiu