Planned post on dataset conversion, dual-gripper heatmaps, and iterable dataset fixes.
Training a ControlNet Model for Bimanual Robotic Manipulation Heatmap Prediction
Overview
This project explores the use of diffusion models, specifically ControlNet, for predicting structured action representations in a bimanual robotic manipulation setting.
The objective is to learn a mapping from:
-
Multi-view visual observations (four camera views)
-
Task instruction (text prompt)
to:
- Spatial heatmaps encoding gripper positions and orientations
This problem is relevant in robotics, where translating perception into structured intermediate representations is a key step toward actionable control.
The implementation includes:
-
Synthetic dataset generation
-
Attempted integration with a real dataset (Hugging Face)
-
A ControlNet-based training and inference pipeline
-
Dataset validation and debugging tools
Problem Statement
The task is to predict bimanual manipulation targets in image space.
Inputs
-
A 2×2 tiled image combining four camera views (front, top, left, right)
-
A natural language instruction (e.g., “Lift the ball”)
Outputs
-
A heatmap image encoding:
-
Left gripper position (Gaussian blob)
-
Right gripper position
-
Gripper orientations (colored directional lines)
-
Constraints
-
Limited dataset size (synthetic dataset ~100 samples)
-
Inconsistent quality in real dataset
-
No explicit supervision separating spatial components
-
Use of pretrained diffusion backbone (Stable Diffusion + ControlNet)
Dataset
Synthetic Dataset
A custom dataset is generated programmatically:
-
Scene includes workspace, object, and two grippers
-
Four camera views are rendered and tiled
-
Heatmaps encode position (Gaussian) and orientation (lines)
Each sample contains:
-
Conditioning image
-
Target heatmap
-
Task instruction
-
Gripper pose metadata
Real Dataset (Hugging Face)
An external dataset (asmitamohanty/bimanual) was explored:
-
Multiple tasks available
-
Data stored as ZIP archives
Observed issues:
-
Corrupted image files
-
Inconsistent directory structure
-
Missing or unclear target formats
Only partial subsets were usable after validation.
[VISUALIZATION_PLACEHOLDER: dataset_overview]
Description: Example tiled conditioning images and corresponding heatmaps
Methodology
Data Preprocessing
Two preprocessing approaches were implemented.
Initial Pipeline
-
Resize images to 512×512
-
Convert to tensor
-
Normalize to [-1, 1]
Improved Pipeline
-
Separate transforms for:
-
Conditioning images
-
Heatmaps
-
Visualization
-
-
Optional Canny edge detection for conditioning
-
Dataset validation including:
-
Image integrity checks
-
Directory cleanup
-
Corruption detection
-
Not implemented or unclear:
-
Data augmentation
-
Spatial alignment validation
-
Explicit heatmap channel decomposition
[VISUALIZATION_PLACEHOLDER: preprocessing_pipeline]
Description: Flow from raw data to model input
Model / Approach
The model is based on Stable Diffusion with ControlNet.
Components:
-
VAE for latent encoding
-
UNet for noise prediction
-
ControlNet for conditioning on images
-
CLIP text encoder for prompts
Key design decisions:
-
Freeze base model components initially
-
Train ControlNet
-
Later experiments allow UNet fine-tuning
-
Use FP32 initially, later introduce mixed precision
Training objective:
- Predict noise in latent space using MSE loss
[VISUALIZATION_PLACEHOLDER: model_architecture]
Description: ControlNet diffusion pipeline with conditioning inputs
Training Procedure
Training loop:
-
Encode heatmaps to latent space
-
Add noise via scheduler
-
Condition on image and text
-
Predict noise
-
Compute MSE loss
Configuration (varies across runs):
-
Learning rate: ~1e-5
-
Batch size: 1–2
-
Steps: up to ~5000
-
Gradient clipping enabled
-
Cosine scheduler
Validation:
-
Limited batches
-
Early implementation contained incorrect validation loss
-
Later version partially corrected
Checkpointing:
-
Periodic saves
-
Best model selected using training loss (not validation)
Experiments & Iterations
Several experiments were conducted:
Synthetic Data Training
- Verified end-to-end pipeline functionality
NaN Loss Fix
-
Switched to FP32
-
Added checks for invalid values
Dataset Debugging
-
Multiple extraction strategies
-
File validation and corruption analysis
Canny Edge Conditioning
- Tested edge maps as conditioning input
UNet Fine-Tuning
- Allowed training of UNet alongside ControlNet
Improved Training Pipeline
-
Added mixed precision
-
Resume capability
-
Visualization during validation
[VISUALIZATION_PLACEHOLDER: experiment_comparisons]
Description: Qualitative or loss comparisons across configurations
Results
Implemented metrics:
-
Mean Squared Error (MSE)
-
Mean Absolute Error (MAE)
-
PSNR
-
SSIM
However:
-
No consolidated results table is available
-
No systematic comparison across experiments
-
Metrics are computed but not consistently analyzed
Observations:
-
Visual outputs show partial alignment between prediction and ground truth
-
Reliability is unclear due to dataset and validation issues
[VISUALIZATION_PLACEHOLDER: results_metrics]
Description: Metric summaries or prediction comparisons
Key Insights
-
ControlNet can learn structured outputs from image inputs
-
Dataset quality strongly impacts training stability
-
Text conditioning is ineffective due to lack of prompt diversity
-
Synthetic data is useful for debugging but limited for generalization
Limitations
-
Training and validation datasets are not properly separated
-
Real dataset contains corrupted samples
-
Validation procedure was incorrect in early implementation
-
No data augmentation
-
Weak supervision for structured outputs
-
Limited reproducibility controls
-
Multiple redundant pipeline implementations
Future Work
-
Standardize dataset and remove corrupted samples
-
Implement proper train/validation/test splits
-
Add augmentation strategies
-
Improve supervision (separate channels, structured losses)
-
Perform controlled experiments
-
Improve evaluation and logging
-
Extend pipeline to action generation
Conclusion
This project establishes a working pipeline for using ControlNet in a bimanual manipulation setting, focusing on predicting structured heatmaps from visual input.
While core components are implemented, the system remains a prototype. Issues in dataset quality, validation design, and experimental rigor limit its reliability.
The work provides a foundation for further development toward a more robust and research-grade system.
Appendix
Example Training Step
latents = vae.encode(target_images).latent_dist.sample()
noise = torch.randn_like(latents)
timesteps = torch.randint(0, T, (batch_size,))
noisy_latents = scheduler.add_noise(latents, noise, timesteps)
model_pred = unet(
noisy_latents,
timesteps,
encoder_hidden_states=text_embeddings,
controlnet_cond=conditioning_images
)
loss = mse(model_pred, noise)