Jinbin Bai

Jinbin received his B.S. in Computer Science from Nanjing University and studies at CS Dept. of National University of Singapore. His research focuses on masked generative modeling, unified multimodal generation, and interactive world models, with an emphasis on building visual-prior-driven systems for content creation and long-horizon generative intelligence (with memory). He works with Prof. Shuicheng Yan and Prof. Ming-Hsuan Yang.

I am trying to find ways to build interactive video world models and algorithms for content creation. I want to build the world with visual prior, though i sadly agree that the language prior dominates current unified models.

News

2026-04 Three papers accepted to ICML 2026, see you in Seoul, South Korea!
2026-02 Three papers accepted to CVPR 2026, see you in Denver, United States!
2026-01 One paper accepted to ICLR 2026, see you in Rio de Janeiro, Brazil!
2025-09 Two papers accepted to NeurIPS 2025.
2025-06 Two papers accepted to ICCV 2025.
2025-04 One paper accepted to CVPR 2025 AI for Content Creation Workshop.
2025-04 Invited Talk from Riot Video Games.
2025-03 Awarded Frontier Top Ten Young Scholars Award (1st) from Century Frontier Asset Management.
2025-03 Invited Talk from University of Illinois Urbana-Champaign (UIUC).
2025-02 One paper accepted to CVPR 2025.
2025-01 One paper accepted to ICLR 2025, see you in Singapore!
2024-11 Invited Talk from Safe SuperIntelligence (SSI) Club.
2024-04 One paper accepted to IJCAI 2024, see you in Jeju, South Korea!
2023-07 Two papers accepted to ICCV 2023.

Selected Publications


Threshold-Guided Optimization for Visual Generative Models
Jinbin Bai, Yu Lei, Qingyu Shi, Aosong Feng, Yi Xin, Zhuoran Zhao, Fei Shen, Kaidong Yu, Xiangtai Li
International Conference on Machine Learning (ICML) 2026
[Paper] [Media_Report_CN]


Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models
Jinbin Bai, Yixuan Li, Yuchen Zhu, Yi Xin, Qingyu Shi, Aosong Feng, Xiaohong Liu, Molei Tao, Jianru Xue, Xiangtai Li, Ming-Hsuan Yang
International Conference on Machine Learning (ICML) 2026
[Paper] [GitHub] [Media_Report_CN]
An efficient test-time-scaling method by pruning with self-verifier and branching with remasking for masked diffusion models to unlock their full generative potential. Works well for both image generation and text generation!


From Masks to Worlds: A Hitchhiker’s Guide to World Models
Jinbin Bai, Yu Lei, Hecong Wu, Yuchen Zhu, Shufan Li, Yi Xin, Xiangtai Li, Molei Tao, Aditya Grover, Ming-Hsuan Yang
Technical Report 2025
[Paper] [GitHub] [Media_Report_CN] [YouTube_EN] [YouTube_KO]
A Hitchhiker’s guide for those who want to build worlds. We follow one clear road: from early masked models, to unified architectures that share a single paradigm, then to interactive generative models, and finally to memory-augmented systems that sustain consistent worlds over time.


Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
Yi Xin, ..., Jinbin Bai, ... (Alpha VLLM Team)
Technical Report 2025
[Paper] [Model] [Code]
Lumina-DiMOO is a unified masked diffusion model that can not only generate high-resolution images, but also support multimodal capabilities including text-to-image, image-to-image, and image understanding. SOTA performance with novel application Interactive Retouching!


Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Qingyu Shi*, Jinbin Bai*, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan (* denotes equal contribution)
International Conference on Learning Representations (ICLR) 2026
[Paper] [Model] [Code] [Media_Report_CN]
Muddit (offical Meissonic II) is a unified masked diffusion model that can not only generate high-resolution images, but support multimodal capabilities including text-to-image, image-to-text, and VQA. We verified one unified model can be trained from visual prior learned by Meissonic!

Meissonic


Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Xiangtai Li, Zhen Dong, Lei Zhu, Shuicheng Yan
International Conference on Learning Representations (ICLR) 2025
[Paper] [Model] [Code] [Demo] [Discord_Discussion] [Toturial_EN] [Toturial_JA] [Media_Report_CN1] [Media_Report_CN2]
Meissonic is a text-to-image masked diffusion model that can generate high-resolution images. It is designed to run on consumer graphics cards. The left figure is generated by Meissonic.


Integrating View Conditions for Image Synthesis
Jinbin Bai, Zhen Dong, Aosong Feng, Xiao Zhang, Kaicheng Zhou
International Joint Conferences on Artificial Intelligence (IJCAI) 2024
[Paper]

Miscellaneous