Classical ML models like tree-based ensembles typically require the entire dataset in memory, making them impractical for large-scale data. The Random Patches technique addresses this by sampling random subsets of both rows and columns to train individual trees, then combining them into an ensemble. This extends the Bagging objective — building maximally diverse trees reduces variance — and empirical results across 13 datasets show it often outperforms traditional random forests. The approach requires an ensemble setting but enables training on datasets that don't fit in memory without resorting to big-data frameworks like Spark MLlib.

5m read timeFrom blog.dailydoseofds.com
Post cover image
Table of contents
An open-source alternative to Anthropic’s most viral feature!Train classical ML models on large datasetsP.S. For those wanting to develop “Industry ML” expertise:

Sort: