Projects &amp; Datasets

]

HoloPart: Generative 3D Part Amodal Segmentation

HoloPart is a novel diffusion-based model that completes partial 3D part segments into full, semantically meaningful parts, even when occluded. It combines local attention for fine-grained geometry and global context attention for shape consistency.

Generative Model

3D Vision

[

]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

UniVLA is a unified vision-language-action framework that enables policy learning across different environments. By deriving task-centric latent actions in an unsupervised manner, UniVLA can leverage data from arbitrary embodiments and perspectives without action labels. After large-scale pretraining from videos, UniVLA develops a cross-embodiment generalist policy that can be readily deployed across various robots by learning an action decoding with minimal cost.

Cross-Embodiment

Github Paper

[

Hongyang Li |

]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

AnimateDiff enables animation generation from personalized text-to-image diffusion models by training plug-and-play motion modules. These modules learn transferable motion priors and use MotionLoRA for efficient adaptation without tuning the base models.

Text-to-Video

Diffusion Animation

[

Bo Dai

]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Vanilla autoregressive models without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.

Autoregressive Model

Image Generation

Page Github Paper

[

]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Janus is a novel autoregressive framework that unifies multimodal understanding and generation.

Autoregressive Model

Github Paper

[

]

SAMPart3D: Segment Any Part in 3D Objects

A scalable zero-shot 3D part segmentation framework that segments 3D objects into semantic parts at multiple granularities without requiring text prompts, enabling applications in robotics, 3D generation, and editing.

3D Vision

[

]

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Scaffold-GS organizes sparse 3D Gaussians around structured anchors in space, dynamically predicting rendering attributes based on viewpoint and distance. It improves rendering quality and efficiency through structured anchor growth and pruning strategies.

3D Gaussian Rendering

View-Adaptivity

[

Bo Dai

]

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

PixArt-α is a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators.

Diffusion Transformer

GitHub Paper

[

]

SAM3D: Segment Anything in 3D Scenes

SAM3D is a zero-shot framework that leverages SAM to segment 3D point clouds by projecting and merging 2D masks from posed RGB images, achieving fine-grained 3D segmentation without training.

3D Vision

[

]

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

T2I-CompBench is a comprehensive benchmark with 6,000 compositional prompts across attribute binding, object relationships, and complex compositions. It includes novel evaluation metrics and GORS, a fine-tuning approach to enhance compositional generation capabilities.

Generative Model

Text-to-Image Generation

[

]

UniAD: Planning-oriented Autonomous Driving

UniAD is a Unified Autonomous Driving algorithm framework following a planning-oriented philosophy. Instead of standalone modular design and multi-task learning, we cast a series of tasks, including perception, prediction and planning tasks hierarchically.

End-to-End

Autonomous Driving

Github Paper Video Slides

[

]

BEVFormer: Learning Bird's-Eye-View Representation From LiDAR-Camera via Spatiotemporal Transformers

A paradigm for autonomous driving that applies both Transformer and Temporal structure to generate BEV features.

Bird's-Eye-View

Blog Github Paper

[

]

BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-Scale Scene Rendering

BungeeNeRF introduces a progressive training scheme that incrementally refines NeRF representations to support extreme multi-scale scene rendering, from city-scale context to high-detail objects.

NeRF

Multi-Scale Rendering

[

Bo Dai

]

Dataset

AgiBot-World: The Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution.

Manipulation

Blog Github Paper Challenge

[