Step 3.7 Flash, a 198B-parameter Mixture-of-Experts vision-language model from StepFun, is now available on NVIDIA-accelerated infrastructure. With ~11B active parameters per forward pass, native image/video input, three reasoning levels, and a 256k context window, it targets enterprise use cases like financial analysis and concurrent coding agents. Developers can deploy it via SGLang, TensorRT-LLM, or vLLM, or use NVIDIA NIM containerized microservices for production. An NVFP4-quantized checkpoint is available on Hugging Face. Fine-tuning is supported through NVIDIA NeMo Automodel (SFT and LoRA at 600 tokens/sec on Hopper GPUs) and NeMo Megatron-Bridge for large-scale training. NVIDIA DGX Station is highlighted for local development with 748 GB coherent memory.

4m read timeFrom developer.nvidia.com
Post cover image
Table of contents
Build with NVIDIA endpointsProduction-ready deployment with NVIDIA NIMDay 0 fine-tuning with NVIDIA NeMo Framework

Sort: