"Processing real-time voice data is an engineering minefield of latency, accents, and interruptions. This session explores the architecture of a Real-Time Voice Intelligence Pipeline deployed in a high-volume contact center.
We will move beyond simple transcription to discuss Structured Intent Extraction. I will show you how to design:

1. Voice Capture Pipeline: The entry point for clean, multi-channel data acquisition.
2. Speech-To-Text(STT) Engine: Converting speech to accurate text.
3. Generative AI Core Structure: Using rigorous system prompts to force the LLM to separate ""Customer Intent"" from ""Operator Chit-Chat"" and output valid JSON, even from garbled transcripts.
4. Customer Data Sync: Translating AI insights into enterprise system actions.

We reduced post-call work by 50% by shifting compute from ""batch"" to ""stream.""

Speaker: Dippu Kumar Singh - Leader Of Emerging Technologies (Apps), Fujitsu North America Inc.

Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Data & Analytics at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.

Socials:
https://www.linkedin.com/in/dippukumarsingh/

Slides:
https://docs.google.com/presentation/d/1f2y1s64irhdDNTRgK6bWrBtOgMWlhQYM/edit?usp=sharing&ouid=107532212133041789455&rtpof=true&sd=true"

AI Engineer

A Fujitsu North America AI architect presents a four-stage pipeline for extracting structured business intelligence from contact center audio streams in real time. The pipeline covers voice capture with stereo channel separation and PII masking, speech-to-text transcription requiring 90%+ accuracy with domain-specific dictionaries, a generative AI core using few-shot prompting and hallucination checks to produce structured JSON summaries, and a CRM sync layer with human verification. Deployed results show after-call work time cut from 6.3 to 3.1 minutes (~50% reduction) across 500-seat contact centers. Current challenges include STT accuracy for heavy accents, LLM token costs on long transcripts, and PII compliance overhead. The roadmap includes explainable AI for operator coaching, predictive staffing from intent data, and real-time abuse detection to protect agent mental health.

VoiceOps-fying Low-Latency Intelligence Extraction from Messy Audio Streams — Dippu Kumar Singh