SpidR is a self-supervised speech representation model that learns linguistic units from unlabeled audio using masked prediction, self-distillation, and online clustering. The model can be pretrained in 15-24 hours on 16 GPUs and outperforms previous methods on language modeling tasks. The repository provides pretrained

Sort: