A team built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. They compared multiple transformer architectures for codon-level language modeling, with CodonRoBERTa-large-v2 achieving the best results (perplexity of 4.10, Spearman CAI correlation of 0.40), outperforming ModernBERT. The project scaled to 25 species, training 4 production models in 55 GPU-hours for $165 total, and produced a species-conditioned system claimed to be unique among open-source projects.
Sort: