Daily Dose of DS offers a daily dose of inspiration, education, and motivation for data scientists and aspiring data professionals. Through bite-sized articles, tutorials, and curated resources, readers embark on a journey to master the art and science of data analysis, machine learning, and artificial intelligence. By staying updated with the latest trends, techniques, and tools in data science, readers can hone their skills and stay ahead in this rapidly evolving field.

Daily Dose of Data Science | Avi Chawla | Substack

Kimi researchers introduced Attention Residuals, a new approach to residual connections in Transformers that addresses the PreNorm dilution problem. Instead of summing all previous layer outputs with equal weight=1, each layer uses softmax attention to selectively weight contributions from prior layers. A practical variant called Block Attention Residuals groups layers into ~8 blocks, reducing memory from O(Ld) to O(Nd) while preserving local information flow. On a 48B parameter model trained on 1.4T tokens, it improved benchmark scores across GPQA-Diamond (+7.5), Math (+3.6), HumanEval (+3.1), and MMLU (+1.1), with less than 2% inference latency overhead.

A New Way to Handle Residual Connections in Transformers

A new way to handle residual connections in Transformers

P.S. For those wanting to develop “Industry ML” expertise: