Step-by-step guide to running a Voice-Language-Action (VLA) demo using Google's Gemma 4 multimodal model on an NVIDIA Jetson Orin Nano Super (8 GB). The pipeline chains Parakeet STT for speech recognition, Gemma 4 via llama.cpp for reasoning and optional webcam vision, and Kokoro TTS for audio output. The model autonomously decides when to capture a webcam frame based on the user's question, with no hardcoded keyword triggers. The tutorial covers building llama.cpp natively with CUDA for Jetson, downloading quantized GGUF model files, configuring audio/webcam devices, memory management tips for the constrained 8 GB board, and running the demo script.
Table of contents
Get the codeHardwareStep 1: System packagesStep 2: Python environmentStep 3: Free up RAM (optional but recommended)Step 4: Serve Gemma 4Step 5: Find your mic, speaker, and webcamStep 6: Run the demoHow it worksTroubleshootingEnvironment variablesBonus: just want to try Gemma 4 in text mode?Sort: