Bimanual Robotic Manipulation with Vision-Language-Action Systems

Tagline: Training diffusion models for dual-arm manipulation
Category: Current Research
GitHub: spacefly

Context

Most VLA models are designed for single-arm manipulation. Bimanual tasks need dataset conversion and action representation that preserve dual-gripper coordination.

Role

Research Engineer owning dataset conversion, action heatmap generation, ControlNet+GenIMA training, and Hugging Face dataset compatibility fixes.

Work completed

Converted PerAct2 bimanual dataset (700+ demonstrations) to LLaVa prompt format
Generated dual-gripper action heatmaps:
- left gripper: blue/cyan
- right gripper: orange/red
- explicit open/closed states
Trained ControlNet with GenIMA on Stable Diffusion 1.5 using image and action conditioning
Fixed iterable dataset compatibility issues for large trajectory training
Improved validation with configurable guidance scales and trajectory-level logging

Impact and learning

Successfully trained with dual-gripper action conditioning at scale
Learned to bridge robotics action-vector datasets with modern prompt-driven VLA pipelines

Bhargav Limbasia

Explorer

Bimanual Robotic Manipulation with Vision-Language-Action Systems

Context

Role

Work completed

Impact and learning

Graph View

Table of Contents

Backlinks