A comprehensive exploration of using text embeddings to predict IMDb movie ratings, comparing traditional statistical models, neural networks, and training LLMs from scratch. The author processes IMDb datasets using Polars, generates embeddings with ModernBERT, and evaluates multiple modeling approaches including Support Vector Machines, MLPs, and custom transformer models. Results show that training a small LLM from scratch on raw JSON movie data achieved the best performance with an MSE of 1.026, outperforming both traditional models and pretrained embedding approaches.

24m read timeFrom minimaxir.com
Post cover image
Table of contents
About IMDb Data #The Initial Assignment and “Feature Engineering” #Creating And Visualizing the Movie Embeddings #Predicting Average IMDb Movie Scores #Conclusion #

Sort: