SpidR is a self-supervised speech representation model that learns linguistic units from unlabeled audio using masked prediction, self-distillation, and online clustering. The model can be pretrained in 15-24 hours on 16 GPUs and outperforms previous methods on language modeling tasks. The repository provides pretrained checkpoints (SpidR and DinoSR on LibriSpeech 960h), training code, and utilities for SLURM cluster deployment. Models require audio standardization and are available via PyPI or torch.hub.

Sort: