Data scientists are not obsolete in the LLM era — their core skills are more critical than ever. The post argues that most AI engineering teams are failing at evaluation fundamentals that data scientists have long mastered. Five recurring eval pitfalls are covered: using generic off-the-shelf metrics instead of application-specific ones, trusting unverified LLM judges without treating them as classifiers, poor experimental design (synthetic test sets not grounded in real data, Likert scales instead of binary pass/fail), bad data and label practices, and over-automating work that requires human judgment. Each pitfall maps directly to a data science fundamental: EDA, model evaluation, experimental design, data collection, and production ML monitoring. The core message is that calling an LLM API doesn't eliminate the need for rigorous data examination, hypothesis-driven metrics, and skeptical validation — it just changes the surface where that work happens.
Table of contents
The Harness Is Data ScienceGeneric MetricsUnverified JudgesBad Experimental DesignBad Data and LabelsAutomating Too MuchOther PitfallsThe MappingVideo & SlidesFootnotesSort: