Would you trust a medical system whose only metric was “which doctor wins the Internet?” No, you'd call that malpractice. Yet that's LMArena.

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

LMArena, a popular AI model leaderboard, is fundamentally flawed because it relies on casual internet users who prioritize superficial qualities like formatting, length, and emojis over factual accuracy. Analysis shows 52% of votes were questionable, with users consistently choosing confident-looking but incorrect answers over accurate ones. The system rewards models that game human attention spans rather than those that provide truthful responses, creating perverse incentives that push the entire AI industry toward optimizing for appearance over substance. This structural problem stems from using unpaid, unvetted volunteers with no quality control, making the leaderboard's influence on model development actively harmful to building reliable AI systems.

LMArena is a cancer on AI

Why It's Broken (And Why It Stays Broken)

If the users like it more and because of that approve it then i think that is an accurate metric.
AI isn’t trustworthy anyways and i think the argument that this makes developers focus on appearance instead is wrong because if devs know one thing it is that reliability/hallucinations is one of the major limitations of llms that they need to solve.

Is it possible this post was generated exclusively by IA?