OpenMed built an end-to-end protein AI pipeline covering structure prediction (ESMFold), sequence design (ProteinMPNN), and codon optimization using transformer-based language models. After comparing multiple architectures, CodonRoBERTa-large-v2 (312M params, RoBERTa-based) outperformed ModernBERT by 6x on perplexity (4.10 vs 26.24) and achieved a CAI Spearman correlation of 0.404. Key finding: pre-trained NLP weights actively hurt biological modeling, and hyperparameter tuning (halving learning rate, doubling warmup) improved biological alignment 16x without changing architecture. The team then scaled to 25 species across bacteria, yeast, and mammals, training a universal multispecies base model plus three species-specific specialists (human, E. coli, CHO) in 55 GPU-hours for ~$165. The human specialist achieved best perplexity (24.3) and is targeted at mRNA therapeutics and vaccines. All models are released on Hugging Face under Apache 2.0.

34m read timeFrom huggingface.co
Post cover image
Table of contents
Part II: Building the Pipeline, From Structure Prediction to Codon Optimization 1. What We Built 2. The Architecture Exploration The Contenders The Training Setup The Results What We Learned 3. The Pipeline 3.1 Protein Folding with ESMFold Our Results: 30 Protein Chains Running ESMFold 3.2 Sequence Design with ProteinMPNN Our Results: Scaffold 7K00 3.3 mRNA Optimization CodonRoBERTa: Our Best Model Evaluation: Three Metrics That Matter Running the Evaluations The Final Leaderboard Using the Model 4. Scaling to Multi-Species The Data Engineering Challenge The Tokenization Innovation Training the Universal Base Model Species-Specific Fine-Tuning The Complete Model Suite Production Deployment Strategy Infrastructure & Reproducibility What This Enables By the Numbers 5. The End-to-End Workflow 6. Where This Stands and What's Next The Landscape In Progress: CodonJEPA Roadmap Setup and Requirements 7. References Key Papers Models and Data: Coming Soon Part II: Building the Pipeline, From Structure Prediction to Codon Optimization1. What We Built2. The Architecture Exploration3. The Pipeline4. Scaling to Multi-Species5. The End-to-End Workflow6. Where This Stands and What's Next7. ReferencesModels and Data: Coming Soon

Sort: