NXP shares hands-on best practices for deploying Vision-Language-Action (VLA) models on embedded robotic platforms, specifically the i.MX95. The guide covers three areas: dataset recording (camera setup, lighting, diversity, recovery episodes), fine-tuning ACT and SmolVLA policies, and hardware optimization techniques including model decomposition, quantization strategies, and asynchronous inference scheduling. Key findings include that quantizing the action expert's denoising flow degrades accuracy significantly, while vision encoder and LLM prefill quantization has limited impact. Optimized ACT on i.MX95 achieves 0.32s inference latency with 89% global accuracy, down from 2.86s at FP32. Next steps include NPU optimization, simulation environments, RL-based policy refinement, and sim-to-real transfer.
Table of contents
🎥 Dataset Recording: What Actually Matters🎛️ Fine‑Tuning VLAs⚡ Optimizing for NXP i.MX95📊 What We Achieve on i.MX95⏩ Next Steps✅ Checklists You Can Reuse📚 Resources & InspirationSort: