Training mRNA Language Models Across 25 Species for $165

OpenMed built an end-to-end protein AI pipeline covering structure prediction (ESMFold), sequence design (ProteinMPNN), and codon optimization using transformer-based language models. After comparing multiple architectures, CodonRoBERTa-large-v2 (312M params, RoBERTa-based) outperformed ModernBERT by 6x on perplexity (4.10 vs 26.24) and achieved a CAI Spearman correlation of 0.404. Key finding: pre-trained NLP weights actively hurt biological modeling, and hyperparameter tuning (halving learning rate, doubling warmup) improved biological alignment 16x without changing architecture. The team then scaled to 25 species across bacteria, yeast, and mammals, training a universal multispecies base model plus three species-specific specialists (human, E. coli, CHO) in 55 GPU-hours for ~$165. The human specialist achieved best perplexity (24.3) and is targeted at mRNA therapeutics and vaccines. All models are released on Hugging Face under Apache 2.0.

#machine-learning

#nlp

#biotech

Apr 20•34m read time•From huggingface.co

Table of contents

Part II: Building the Pipeline, From Structure Prediction to Codon Optimization 1. What We Built 2. The Architecture Exploration The Contenders The Training Setup The Results What We Learned 3. The Pipeline 3.1 Protein Folding with ESMFold Our Results: 30 Protein Chains Running ESMFold 3.2 Sequence Design with ProteinMPNN Our Results: Scaffold 7K00 3.3 mRNA Optimization CodonRoBERTa: Our Best Model Evaluation: Three Metrics That Matter Running the Evaluations The Final Leaderboard Using the Model 4. Scaling to Multi-Species The Data Engineering Challenge The Tokenization Innovation Training the Universal Base Model Species-Specific Fine-Tuning The Complete Model Suite Production Deployment Strategy Infrastructure & Reproducibility What This Enables By the Numbers 5. The End-to-End Workflow 6. Where This Stands and What's Next The Landscape In Progress: CodonJEPA Roadmap Setup and Requirements 7. References Key Papers Models and Data: Coming Soon Part II: Building the Pipeline, From Structure Prediction to Codon Optimization 1. What We Built 2. The Architecture Exploration 3. The Pipeline 4. Scaling to Multi-Species 5. The End-to-End Workflow 6. Where This Stands and What's Next 7. References Models and Data: Coming Soon

Comment

Bookmark

Copy

Sort: