Simulator-in-the-loop optimization offers a promising inference-time mechanism for robot manipulation. It uses a physical simulator as a backend rollout engine to evaluate candidate trajectories in parallel and refine nominal actions online, a paradigm proven effective in rigid-body manipulation where state and contact are relatively tractable. We bring this paradigm to real-world cloth manipulation from a single RGB input through three pillars.
(i) We design a scalable synthetic-data generation and inference-time rollout pipeline built on FLASH, a deformable-object simulator that provides a practical balance among physical fidelity, numerical stability, and rollout efficiency. (ii) We develop a real-to-sim module, trained purely on synthetic data, that maps a single RGB observation to simulation-compatible cloth state by fusing pretrained visual features with learnable canonical tokens. (iii) We perform online planning by coupling a sparse-mesh rollout backend with prior-guided MPPI, anchored at an offline-distilled policy trajectory, preserving manipulation-relevant deformation and contact while enabling sufficient parallel rollout batches. Real-robot experiments show higher success rates and stronger robustness than baseline methods.
Overview of the proposed framework. Offline, we use FLASH to generate synthetic data for real-to-sim training. Online, RGB observations initialize physics rollouts that refine the prior policy through MPPI for closed-loop hardware execution.
Self-collision is disabled during MPPI rollouts for sampling efficiency.
We build a unified simulator-in-the-loop framework for real-world cloth manipulation. Offline, FLASH generates scalable synthetic cloth-deformation data to train an RGB-native real-to-sim module. Online, a single RGB observation is lifted into a sparse simulation-compatible cloth state that synchronizes a sparse-mesh rollout backend; a prior-guided MPPI controller then refines the nominal policy trajectory through parallel physics rollouts and applies the optimized action in a receding-horizon loop on hardware.
A deformable-object simulator balancing physical fidelity, numerical stability, and rollout efficiency. It represents cloth as a triangular mesh, resolves frictional contact through non-smooth Newton iterations, and reduces computation to GPU-friendly sparse operations — enabling stable parallel rollouts for online control.
A cloth state estimator trained purely on synthetic data that maps a single RGB image to a simulation-compatible cloth state. It fuses frozen DINOv2 visual features with learnable canonical tokens and an MLP decoder to predict the deformed 3D vertices of the garment mesh.
An online planner coupling a sparse-mesh rollout backend with MPPI, anchored at an offline-distilled policy trajectory. It preserves manipulation-relevant deformation and contact while keeping enough parallel rollout batches under a real-time control budget.
FLASH is a deformable-object simulator: it models cloth as a triangular mesh and resolves frictional contact with non-smooth Newton iterations, reducing the heavy computation to GPU-friendly sparse operations for stable, repeatable rollouts. We plug each simulator in as the rollout engine of one and the same vanilla MPPI controller — holding the cost, initial state, seed, and cloth asset fixed, and tuning every simulator's physical parameters as carefully as we could — so that the observed differences are attributable to the simulator backend under the shared MPPI setup. Under this fair comparison FLASH gives the best accuracy–fidelity–speed trade-off across all tested K — the number of parallel rollout environments (candidate action sequences evaluated per MPPI step): 95–100% success with keypoint error and MSE about an order of magnitude lower than the others. The videos below show these vanilla-MPPI diagonal-fold rollouts under the same controller, cost, initial state, seed, K, and cloth asset — only the simulator backend differs.
| Simulator | K | Step [ms] ↓ | MPPI [s] ↓ | KP err. [cm] ↓ | MSE [cm²] ↓ | CD [cm] ↓ | EMD [cm] ↓ | SR ↑ |
|---|---|---|---|---|---|---|---|---|
| Newton | 64 | 41.0 | 0.85 | 8.40 ± 8.72 | 56.8 ± 73.1 | 0.89 ± 0.71 | 1.77 ± 1.52 | 40% |
| 128 | 55.1 | 1.14 | 10.28 ± 11.08 | 68.9 ± 93.7 | 0.73 ± 0.40 | 1.52 ± 1.22 | 40% | |
| 256 | 104.3 | 2.15 | 8.85 ± 10.71 | 58.1 ± 89.6 | 0.77 ± 0.50 | 1.56 ± 1.36 | 60% | |
| 512 | 232.8 | 4.80 | 13.02 ± 13.06 | 92.3 ± 105.3 | 0.91 ± 0.62 | 1.97 ± 1.60 | 35% | |
| Isaac Sim | 64 | 11.7 | 0.26 | 9.78 ± 10.74 | 60.5 ± 104.4 | 0.63 ± 0.36 | 1.47 ± 1.07 | 55% |
| 128 | 22.2 | 0.50 | 10.88 ± 11.13 | 67.4 ± 108.0 | 0.66 ± 0.33 | 1.56 ± 1.03 | 45% | |
| 256 | 41.1 | 0.92 | 7.09 ± 5.60 | 32.4 ± 47.6 | 0.55 ± 0.15 | 1.16 ± 0.55 | 55% | |
| 512 | 80.7 | 1.79 | 5.84 ± 4.73 | 24.6 ± 32.8 | 0.54 ± 0.11 | 1.18 ± 0.41 | 60% | |
| FLASH | 64 | 19.4 | 0.40 | 1.61 ± 1.18 | 3.7 ± 4.3 | 0.53 ± 0.42 | 0.92 ± 1.22 | 100% |
| 128 | 36.0 | 0.75 | 1.54 ± 0.88 | 3.1 ± 1.9 | 0.44 ± 0.08 | 0.66 ± 0.27 | 100% | |
| 256 | 68.9 | 1.42 | 1.95 ± 1.82 | 5.5 ± 11.0 | 0.56 ± 0.34 | 1.02 ± 1.15 | 95% | |
| 512 | 137.3 | 2.84 | 2.08 ± 1.93 | 3.9 ± 3.2 | 0.58 ± 0.59 | 1.02 ± 1.52 | 95% |
Simulator backend evaluation under the same cloth-manipulation task and MPPI controller. K = number of parallel rollout environments (the candidate action sequences MPPI samples and evaluates in parallel at each step). Step / MPPI: wall-clock per dynamics step / per rollout batch; KP err.: keypoint alignment error; SR: success rate. Bold marks the best per column.
Each clip is a live vanilla-MPPI diagonal fold with the simulator used as the rollout engine inside the MPPI loop. The controller, cost, initial state, seed, K, and cloth asset are identical across the three columns — only the simulator backend changes, so the observed behaviors reflect backend-specific rollout and control dynamics. KP err. = settled pick-to-target corner distance (lower is better).
FLASH — KP err. 1.27 cm
Isaac Sim — KP err. 5.58 cm
Newton — KP err. 18.61 cm
FLASH — KP err. 1.49 cm
Isaac Sim — KP err. 15.56 cm
Newton — KP err. 30.85 cm
FLASH — KP err. 0.80 cm
Isaac Sim — KP err. 19.56 cm
Newton — KP err. 31.75 cm
Clarification. Simulator parameters are selected using a predefined folding trajectory, while the clips show closed-loop MPPI rollouts with those parameters. Newton's fly-away occurs after MPPI control terminates, suggesting post-control simulation instability rather than MPPI intentionally selecting a fly-away action. Isaac Sim mainly fails on pin tracking: under the real-time per-step iteration budget its constraint solver does not converge enough for the controlled vertices to track their kinematic targets, so MPPI reaches its step limit before the cloth reaches the goal.
The real-to-sim module recovers a simulation-compatible cloth state from a single RGB image. A frozen DINOv2 encoder extracts dense patch features; N learnable canonical tokens (one per mesh vertex) aggregate visual evidence and exchange structural context through self-attention, and a shared decoder maps each token to its deformed 3D vertex position. Trained purely on synthetic RGB data with image- and latent-level augmentation (including random grasping-point masking), it bridges the sim-to-real visual gap without any real-world annotation.
Against a point-cloud diffusion estimator (DPM) and a depth-based variant (DeFM), the RGB-native module attains the lowest reconstruction error on every metric and the fastest inference — about 7–12 ms per frame versus 300–500 ms for the diffusion baseline.
| Asset | Real-to-Sim | MSE [mm²] ↓ | CD [mm] ↓ | EMD [mm] ↓ | Latency [ms] ↓ |
|---|---|---|---|---|---|
| Towel | Ours (RGB) | 2.33 ± 1.21 | 1.94 ± 0.41 | 1.98 ± 0.45 | 7.4 ± 0.6 |
| DPM | 7.17 ± 2.22 | 3.89 ± 0.67 | 4.01 ± 0.72 | 318.5 ± 3.7 | |
| DeFM | 43.36 ± 34.66 | 5.51 ± 0.99 | 6.53 ± 1.52 | 12.6 ± 0.5 | |
| Long-sleeve Shirt | Ours (RGB) | 24.23 ± 11.49 | 3.65 ± 0.78 | 4.67 ± 1.24 | 12.4 ± 0.4 |
| DPM | 45.74 ± 14.19 | 5.93 ± 0.27 | 8.21 ± 0.65 | 512.9 ± 6.0 | |
| DeFM | 43.52 ± 22.02 | 4.05 ± 0.38 | 5.93 ± 0.91 | 18.0 ± 0.6 |
State-estimation comparison: per-vertex MSE, Chamfer Distance, Earth Mover's Distance, and single-frame inference latency on the validation split. Best per column in bold.
Real-world real-to-sim reconstruction on (a) towel and (b) long-sleeve top: from top to bottom, the real observation, predicted vertices overlaid on the image, and the reconstructed cloth.
A base policy provides a nominal action sequence as a task-level prior; at each step we sample K candidates around it, roll them out in FLASH, and let MPPI reweight them into a refined sequence, executing only the first action in a receding-horizon loop. On real hardware the full pipeline reaches 9/10 and 8/10 on single- and dual-arm folding, far above base-policy execution (3/10, 1/10), and replacing any single component degrades the whole loop.
| Task | Real-to-Sim | Backend | Controller | Real SR ↑ |
|---|---|---|---|---|
| Single-arm diagonal | Ours (RGB) | FLASH | MPPI | 9/10 |
| DPM | FLASH | MPPI | 6/10 | |
| Ours (RGB) | Isaac Sim | MPPI | 5/10 | |
| Ours (RGB) | FLASH | Base-policy | 3/10 | |
| Dual-arm symmetric | Ours (RGB) | FLASH | MPPI | 8/10 |
| DPM | FLASH | MPPI | 3/10 | |
| Ours (RGB) | Isaac Sim | MPPI | 4/10 | |
| Ours (RGB) | FLASH | Base-policy | 1/10 |
Real-world pipeline-variant evaluation: each row replaces one component of the full pipeline (first row). Success = all target corner pairs aligned within 2 cm.
Real-world cloth folding across diverse garments and initial configurations, executed by the closed-loop simulator-in-the-loop controller.
@misc{liu2026silr,
title = {Enabling Robust Cloth Manipulation via Inference-Time Simulator-in-the-Loop Refinement},
author = {Liu, Xin and Li, Yulin and Li, Ziming and Jing, Pengyu and Huang, Zhenhao and
Zhou, Bingyang and Zeng, Ziqiu and Luo, Siyuan and Qi, Chenkun and Shi, Fan},
year = {2026},
}