MMLabAbout UsResearchJoin

Projects & Datasets

Project

stars
AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
AgiBot World Colosseo is a full-stack and open-source embodied intelligence ecosystem. Based on our hardware platform AgiBot G1, we construct AgiBot World—an open-source robot manipulation dataset collected by more than 100 homogeneous robots, providing high-quality data for challenging tasks spanning a wide spectrum of real-life scenarios.
stars
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
UniVLA is a unified vision-language-action framework that enables policy learning across different environments. By deriving task-centric latent actions in an unsupervised manner, UniVLA can leverage data from arbitrary embodiments and perspectives without action labels. After large-scale pretraining from videos, UniVLA develops a cross-embodiment generalist policy that can be readily deployed across various robots by learning an action decoding with minimal cost.
stars

Cross-Embodiment

stars
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Vanilla autoregressive models without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
stars

Autoregressive Model

Image Generation

stars
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Janus is a novel autoregressive framework that unifies multimodal understanding and generation.
stars

Autoregressive Model

stars
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
PixArt-α is a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators.
stars

Diffusion Transformer

stars
UniAD: Planning-oriented Autonomous Driving
UniAD is a Unified Autonomous Driving algorithm framework following a planning-oriented philosophy. Instead of standalone modular design and multi-task learning, we cast a series of tasks, including perception, prediction and planning tasks hierarchically.
stars

End-to-End

Autonomous Driving

stars
BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera via Spatiotemporal Transformers
A paradigm for autonomous driving that applies both Transformer and Temporal structure to generate BEV features.
stars
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
Scaffold-GS organizes sparse 3D Gaussians around structured anchors in space, dynamically predicting rendering attributes based on viewpoint and distance. It improves rendering quality and efficiency through structured anchor growth and pruning strategies.
stars

3D Gaussian Rendering

View-Adaptivity

stars
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
AnimateDiff enables animation generation from personalized text-to-image diffusion models by training plug-and-play motion modules. These modules learn transferable motion priors and use MotionLoRA for efficient adaptation without tuning the base models.
stars

Text-to-Video

Diffusion Animation

stars
BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-Scale Scene Rendering
BungeeNeRF introduces a progressive training scheme that incrementally refines NeRF representations to support extreme multi-scale scene rendering, from city-scale context to high-detail objects.
stars

NeRF

Multi-Scale Rendering

Dataset

stars
AgiBot-World: The Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution.
stars
DriveLM: Driving with Graph Visual Question Answering
Facilitating the Perception, Prediction, Planning, Behavior, Motion tasks with human-written reasoning logic as a connection in between.
stars
OpenDV
The largest driving video dataset to date, containing more than 1700 hours of real-world driving videos.
stars

Autonomous Driving

World Model