MMLabResearchEvent
landing

Research

Publication

CelebA
Deep Learning Face Attributes in the Wild

CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including

  • 10,177 number of identities,
  • 202,599 number of face images, and
  • 5 landmark locations, 40 binary attributes annotations per image.

The dataset can be employed as the training and test sets for the following computer vision tasks: face attribute recognition, face recognition, face detection, landmark (or facial part) localization, and face editing & synthesis.

T2I-CompBench
Comprehensive Compositional Text-to-Image Generation

T2I-CompBench and T2I-CompBench++ are the first comprehensive benchmark for compositional text-to-image generation. It uniquely incorporates both template-based and natural language formats, and covers seen and unseen compositions, as well as scenarios with multiple and mixed objects and attributes. The prompts in this benchmark offer significant diversity, including:

  • 8,000 prompts of compositionality, featuring 2,470 nouns, 33 colors, 32 shapes, 23 textures, 10 spatial relationships, and 875 non-spatial relationships.
  • 4 categories and 8 sub-categories, addressing attribute binding, numeracy, object relationships, and complex compositions.
  • 4 types of evalution metrics, specifically designed for accuracy assessment.

The benchmark can be used as both training and test sets to evaluate multiple text-to-image generation capabilities: compositional text-to-image generation, prompt following ability, and the generation of images from complex prompts.

UniAD
Planning-oriented Autonomous Driving

🚘 Planning-oriented philosophy: UniAD is a Unified Autonomous Driving algorithm framework following a planning-oriented philosophy. Instead of standalone modular design and multi-task learning, we cast a series of tasks, including perception, prediction and planning tasks hierarchically.

🏆 SOTA performance: All tasks within UniAD achieve SOTA performance, especially prediction and planning (motion: 0.71m minADE, occ: 63.4% IoU, planning: 0.31% avg.Col)

AnimateDiff
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

AnimateDiff is an innovative AI model that extends text-to-image generation to create animated videos. It achieves this by integrating a motion module, trained on vast amounts of video data to understand realistic movement. This allows users to generate dynamic animation sequences directly from text prompts. Key features include:

  • Transforming static images into dynamic video.
  • Diverse animation styles, from anime to photorealistic.
  • Precise camera motion control (pan, zoom, rotate) using LoRA.
  • Compatibility with existing models like Stable Diffusion and ControlNet.

AnimateDiff significantly lowers the barrier to animation creation, enabling more creators to bring their textual ideas to life.

Explore All Works

More >
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
Parallelized Autoregressive Visual Generation
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
A Survey of Interactive Generative Video
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
Centaur: Robust End-to-End Autonomous Driving with Test-Time Training
Decoupled Diffusion Sparks Adaptive Scene Generation
DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation
GameFactory: Creating New Games with Generative Interactive Videos

Open Source

Discover All
Projects and Datasets

More >

Event

Nashville

Nashville

CVPR 2025

2025.06