Please subscribe to our YouTube channel @ https://www.youtube.com/@DevoxxForever
Subscribe to LinkedIn @ https://www.linkedin.com/company/voxxed-days-amsterdam
Follow us on Twitter @ https://twitter.com/voxxedamsterdam

As more AI systems move into production, ensuring models are responsible, ethical, and sustainable becomes critical. This talk explores how observability can help teams build AI systems they can actually trust.

For the last few years, I’ve been driven by the question: what responsibility do we have for the software we put into the world? That question originally led me to focus on reducing the environmental impact of software. Now, as teams deploy inference-heavy LLM workloads into production, it’s become even more pressing.

Using LLMs at scale raises serious questions around energy usage, cost, and ethics. A Gartner report predicts that 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs and unclear business value. How much of that comes down to a lack of visibility into how these systems behave once they’re live?

So how do you build trust in AI systems? How do you detect drift toward bias or toxicity? How do you understand the real resource cost and carbon footprint of inference at scale? And once you do have visibility, how do you turn it into meaningful guardrails?

In this talk, we’ll look at
Why traditional monitoring falls short for AI,
Which signals actually matter for responsible systems,
How to instrument AI workloads with OpenTelemetry,
How to use telemetry data to implement guardrails that improve trust, sustainability, and reliability in production AI.

Devoxx

Trust in AI systems doesn't come from understanding the models themselves, but from building robust observability systems around them. Using OpenTelemetry, developers can instrument AI workloads across four layers: development tooling (tracking AI coding assistants like Claude Code), operational metrics (token usage, cost, finish reasons), decision tracing (end-to-end agentic loop traces with custom attributes explaining why decisions were made), and quality monitoring (using small language models as evaluators for hallucinations, toxicity, and policy violations). Real-time guardrails can block harmful inputs and outputs before they reach users. The key insight is that traditional monitoring signals are insufficient for AI applications — you need to trace decision paths, not just inputs and outputs, and instrument specifically for AI workloads using OpenTelemetry semantic conventions.

Un-observable AI is un-trustworthy AI by Annie Freeman