Waterline Development, a water desalination startup, lost four months and $200,000 after LLMs like Grok and ChatGPT gave confidently wrong answers during materials science research. In response, they built Rozum, a multi-model orchestration system that runs an ensemble of commercial, open-weight, and domain-specialized AI models in parallel, then passes results through a deterministic verification layer to detect hallucinations, errant claims, and phony citations. Rozum outscored GPT-4, Grok 4, and Gemini 3.1 Pro on the Humanity's Last Exam benchmark, and its verification layer flagged unsupported claims in 76.2% of frontier model responses. The system is slower and more expensive than single-model solutions but targets high-stakes research and decision-making where accuracy outweighs cost. Rozum has now been spun out as a standalone AI startup and is available via waitlist.
Table of contents
Time to buildSort: