AIニュース最前線
最新ニュースAI日報Hacker日報週報動画AIツールトレンド企業

AIニュース最前線

世界中のAI最新情報を日本語で毎時更新

最新ニュース日報トレンド企業プレミアムRSS
© 2026 ainew.jp特定商取引法に基づく表記
ニュース一覧元記事を開く
Hugging Face Blog·2026年3月31日 17:23·約5分で読める

165ドルで25種のmRNA言語モデルを訓練

#mRNA言語モデル#タンパク質AI#バイオインフォマティクス#トランスフォーマー#オープンソース#医療AI
TL;DR

OpenMedは25種の生物種向けにmRNA言語モデルを低コストで訓練し、タンパク質構造予測からコドン最適化までのエンドツーエンドAIパイプラインを構築したことを発表した。

AI深層分析2026年4月8日 19:42
4
重要/ 5段階
深度40%
5
関連度30%
5
実用性20%
4
革新性10%
4

キーポイント

1

低コストでの大規模訓練

25種の生物種向けにmRNA言語モデルを訓練するのに55GPU時間と165ドルという低コストで実現し、種別条件付きシステムを構築した。

2

最適なアーキテクチャの特定

複数のトランスフォーマーアーキテクチャを比較し、CodonRoBERTa-large-v2がperplexity 4.10、Spearman CAI相関0.40でModernBERTを大幅に上回ることを確認した。

3

エンドツーエンドパイプライン構築

タンパク質構造予測、配列設計、mRNA最適化の3段階からなる完全なパイプラインを構築し、治療用タンパク質の概念から合成可能なDNA配列までを短期間で実現する。

4

オープンソースでの透明性

成功事例だけでなく、試行錯誤の過程や実行可能なコード、完全な結果を公開し、研究コミュニティへの貢献を目指している。

5

mRNA最適化のアーキテクチャ探索

コドンレベルの言語モデリングに最適なトランスフォーマーアーキテクチャを探索し、CodonBERTベースラインからModernBERTとRoBERTaファミリーへ拡張した。

6

コドン最適化の重要性と独自開発

コドン最適化は治療用mRNAやワクチン生産に重要であり、既存ツールではなく独自のモデル・トレーニング基盤・評価指標を開発した。

7

RoBERTaがModernBERTを大幅に上回る性能

RoBERTaアーキテクチャはModernBERTよりも6倍優れたパープレキシティ(4.01対26.24)を示し、コドン配列のモデリングにおいて明確な優位性を確認した。

影響分析・編集コメントを表示

影響分析

この研究は、AIを活用した創薬・バイオテクノロジーのプロセスを大幅に効率化する可能性を示しており、特に低コストでの多種対応モデル訓練は研究機関や中小企業へのアクセスを拡大する。オープンソースでの透明性ある公開は、コミュニティ全体の進歩を加速させる重要な貢献となる。

編集コメント

低コストで多種対応のmRNAモデル訓練を実現した点が画期的で、AIとバイオテクノロジーの融合領域における実用化のハードルを下げる重要な一歩と言える。

CAI(コドン適応指数)ベースの最適化は機能しますが、それは粗雑な方法です。各コドン位置を独立して扱い、配列の文脈を無視します。反復的な配列(特定のアミノ酸に対して常に同じ「最適な」コドンが使用される)を生成し、これはリボソームの停止やmRNA二次構造の問題を引き起こす可能性があります。また、複雑な依存関係を見逃しています:位置50の最適なコドンは、位置48と52にあるコドンに依存するかもしれませんが、頻度表ではこれを捉えることができません。

私たちのアプローチ:マスク言語モデリング

コドン最適化を言語モデリング問題として再定義します。表で頻度を調べる代わりに、BERT、RoBERTa、MetaのESMタンパク質モデルで使用されているのと同じ事前学習手法であるマスク言語モデリング(MLM)を用いて、数十万の自然なコーディング配列でトランスフォーマーを訓練します。モデルは、位置の15%がマスクされたコドン配列を見て、文脈から欠落したコドンを予測することを学習します。

モデルが暗黙的に学習するのは、コドン使用の文法です:自然界にどのコドンパターンが現れるか、どのコドンが共起する傾向があるか、周囲の配列文脈に応じて選好がどのように変化するか。これは頻度表よりも根本的に豊かです。なぜなら、モデルはコーディング配列全体にわたる長距離依存関係を捉えるからです。

CodonRoBERTa:私たちの最高のモデル

アーキテクチャ探索(上記参照)の後、CodonRoBERTa-large-v2が勝者として浮上しました:

configs/mrna/production/roberta_large_v2.yaml

model_type: roberta

hidden_size: 1024

num_hidden_layers: 24

num_attention_heads: 16

intermediate_size: 4096

vocab_size: 69

max_position_embeddings: 8192

learning_rate: 5e-5 # 重要:v1よりも低い

warmup_steps: 2000 # 重要:より長いウォームアップ

max_steps: 25000

python scripts/training/run_mlm_train.py \

--config configs/mrna/roberta_large_v2.yaml \

--train_file data/mrna/processed/train_250k.fasta \

--output_dir outputs/models/CodonRoBERTa-large-v2

評価:重要な3つの指標

コドン言語モデルの評価は簡単ではありません。上記のv1/v2比較から学んだように、モデルは優れたパープレキシティ(マスクされたコドンを正確に予測)を持ちながら、生物学的整合性(自然界が実際に好まないコドンを予測)が低い場合があります。私たちは3つの補完的な軸で評価します:

  1. パープレキシティは、モデルがマスクされたコドンをどれだけうまく予測するかを測定し、指数化されたクロスエントロピー損失として計算されます。パープレキシティ4.10は、モデルが各マスク位置で平均して約4つの同様に可能性のあるコドンから選択していることを意味します。ほとんどのアミノ酸には2〜6個の同義コドンがあることを考えると、これはモデルが意味のある選好を学習したことを示しており、均一に推測しているわけではありません。低いほど良いです。CodonRoBERTa-large-v2:4.10。
  1. CAI相関(スピアマン)は、モデルの予測したコドン尤度が既知の生物学的コドン使用選好と一致するかどうかを測定します。各テスト配列のコドン適応指数を計算し、それをモデルの疑似対数尤度スコアと相関させます。正の相関は、モデルが生物学が実際に使用する配列に高い確率を割り当てることを意味します。これは実用的なコドン最適化にとって最も重要な指標です。なぜなら、モデルが生物学的に関連するパターンを学習したか、単なる統計的パターンを学習したかを直接測定するからです。CodonRoBERTa-large-v2:0.404(p < 10^-20)。
  1. 同義回復は問います:モデルがマスク位置のコドンを予測するとき、少なくともアミノ酸は正しく取得しますか?たとえ間違った同義語(例:ロイシンに対してCTCの代わりにCTT)を選んだとしても、正しいアミノ酸を予測することは、モデルがタンパク質レベルの制約を理解していることを示します。CodonRoBERTa-large-v2:12.1% top-1同義。

評価の実行

パープレキシティ

python scripts/evals/advanced/eval_perplexity.py \

--model outputs/models/CodonRoBERTa-large-v2/final \

--test_file data/mrna/processed/test_6k.fasta \

--output outputs/eval_results/CodonRoBERTa-large-v2/perplexity.json

CAI相関

python scripts/evals/advanced/eval_cai_correlation.py \

--model outputs/models/CodonRoBERTa-large-v2/final \

--test_file data/mrna/processed/test_6k.fasta \

--output outputs/eval_results/CodonRoBERTa-large-v2/cai_correlation.json

同義回復

python scripts/evals/advanced/eval_synonymous_recovery.py \

--model outputs/models/CodonRoBERTa-large-v2/final \

--test_file data/mrna/processed/test_6k.fasta \

--output outputs/eval_results/CodonRoBERTa-large-v2/synonymous_recovery.json

最終リーダーボード

モデルバリアント全体をまとめると:

CodonRoBERTa-large-v2

CodonRoBERTa-base

Limited compute

CodonRoBERTa-large

ModernBERT-base

CodonBERT(ベースライン)

RoBERTaファミリーが全面的に優れています。実用には、CodonRoBERTa-large-v2が明確な選択です:競争力のあるパープレキシティを維持しながら、最も強い生物学的整合性(CAI 0.404)を持っています。計算リソースが限られているチームには、CodonRoBERTa-baseが3.4倍少ないパラメータでほぼ同じパープレキシティを提供します。ModernBERTは大幅に性能が劣りました。これは、NLP事前学習済みの重みがコドンパターンの学習を妨げたためと考えています。

モデルの使用

from transformers import RobertaForMaskedLM

import torch

モデルのロード(近日中にHugging Faceで利用可能)

model = RobertaForMaskedLM.from_pretrained("OpenMed/CodonRoBERTa-large-v2")

tokenizer = CodonTokenizer() # 私たちのカスタム69トークン語彙

配列のスコアリング

sequence = "ATG GCT AAA GGT..." # スペース区切りのコドン

inputs = tokenizer(sequence, return_tensors='pt')

with torch.no_grad():

outputs = model(**inputs)

疑似対数尤度は「自然な

原文を表示

Back to Articles Training mRNA Language Models Across 25 Species for $165

Upvote 20

Part II: Building the Pipeline, From Structure Prediction to Codon Optimization

By OpenMed, Open-Source Agentic AI for Healthcare & Life Sciences

TL;DR: We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.

The Architecture Exploration

The Pipeline 3.1 Protein Folding

3.2 Sequence Design

3.3 mRNA Optimization

Scaling to Multi-Species

The End-to-End Workflow

Where This Stands and What's Next

Imagine going from a therapeutic protein concept to a synthesis-ready, codon-optimized DNA sequence in an afternoon. That is the pipeline OpenMed set out to build, and this post documents the process from start to finish.

In Part I, we mapped the landscape of protein AI: the architectures powering structure prediction, the open-source tools available for protein design, and the ecosystem of models from AlphaFold to ESMFold. That was a survey. This is the build.

At OpenMed, we set out to build a complete pipeline that takes a protein idea from concept to expression-ready DNA. That means three stages: predict the 3D structure of a protein, design amino acid sequences that fold into that structure, and optimize the underlying DNA codons so the protein actually expresses in the target organism. Along the way, we ran extensive experiments comparing transformer architectures for codon optimization, scaled our best model to 25 species, and built tooling that ties it all together.

This is not a polished success story. It is a transparent account of what worked, what surprised us, and what we would do differently, with runnable code and full results at every step.

  1. What We Built

The pipeline has three components, each addressing a different stage of the protein engineering workflow described in Part I. Structure prediction determines what shape a protein takes. Sequence design determines which amino acids will produce that shape. Codon optimization determines which DNA will produce those amino acids efficiently in a living cell.

Protein Folding

ESMFold v1 predictions on 30 protein chains

Avg PTM: 0.79, working batch pipeline

Sequence Design

ProteinMPNN on scaffold 7K00

42% sequence recovery

mRNA Optimization

Trained multiple transformer variants on 250k CDS, then scaled to 381k sequences across 25 species

CodonRoBERTa-large-v2: perplexity 4.10, CAI 0.40; Multi-species suite: 4 models spanning 25 organisms (55 GPU-hours)

The mRNA optimization work is where we invested the most effort, and where we have the most to share. The folding and design components use established tools (ESMFold from Meta, ProteinMPNN from the Baker Lab, both covered in depth in Part I). The codon optimization component is entirely ours: new models, new training infrastructure, new evaluation metrics.

  1. The Architecture Exploration

In Part I, we surveyed the protein AI landscape and noted that most biological language models are adaptations of NLP architectures. The open question was which architecture. BERT variants dominate protein modeling (ESM-2, ProtTrans), but codon sequences have different statistical properties than both natural language and amino acid sequences. Codons are triplets drawn from a small 64-token alphabet, with strong positional dependencies and species-specific usage biases. We needed to find out what works from first principles.

The core question: which transformer architecture works best for codon-level language modeling?

This matters because codon optimization is crucial for therapeutic mRNA, vaccines, and recombinant protein production. The genetic code is degenerate: the same protein can be encoded by astronomically many different DNA sequences, but some codon arrangements express 100x better than others. The Pfizer-BioNTech COVID vaccine, for example, was codon-optimized for human expression. We wanted to build a model that could learn these preferences directly from natural coding sequences, rather than relying on hand-crafted frequency tables.

We started with a small CodonBERT baseline (6M params, following Sanofi's published architecture) and scaled up through two families: ModernBERT, which represented the latest efficiency innovations from the NLP community, and RoBERTa, the proven workhorse behind Meta's ESM protein language models.

CodonBERT (baseline)

BERT-tiny (6 layers)

Minimal baseline to establish floor performance

ModernBERT-base

ModernBERT (22 layers, RoPE)

Modern innovations: long context, efficient attention

CodonRoBERTa-base

RoBERTa (12 layers)

Proven MLM architecture, same family as ESM-2

CodonRoBERTa-large

RoBERTa (24 layers)

Test whether more parameters improve codon modeling

CodonRoBERTa-large-v2

RoBERTa (24 layers, refined)

Same architecture, better hyperparameters

The choice of RoBERTa was deliberate. As we discussed in Part I, Meta's ESM-2 (which powers ESMFold) is itself a RoBERTa variant trained on protein sequences. We hypothesized that the same architecture family that learned amino acid patterns might also learn codon patterns. ModernBERT was the counterpoint: a 2024 architecture with RoPE embeddings, Flash Attention, and alternating local/global attention layers, representing everything the NLP community had learned since RoBERTa's 2019 release.

The Training Setup

To ensure a fair comparison, every model was trained on identical data with the same evaluation protocol. We used 250,000 coding sequences (CDS) from E. coli RefSeq, covering chromosome and complete assembly accessions. This is a clean, well-annotated dataset where codon usage patterns are well-characterized in the literature, giving us ground truth to validate against.

Our tokenizer maps each codon to a single token: 64 codons plus 5 special tokens (PAD, UNK, CLS, SEP, MASK) for a 69-token vocabulary. This is intentionally minimal. Unlike BPE tokenizers used in NLP, where subword boundaries are statistically learned, codon boundaries are biologically defined. Every three nucleotides encode one amino acid. Our tokenizer respects this.

Training ran on 4 A100 GPUs (80GB) with FSDP sharding, using 15,000 to 25,000 steps depending on model size. All models used masked language modeling (MLM) with 15% masking rate, the same objective used by ESM-2 for protein sequences.

Synonymous Recovery

CodonRoBERTa-large-v2

CodonRoBERTa-base

Best Efficiency

CodonRoBERTa-large

Good MLM, weak bio

ModernBERT-base

CodonBERT (baseline)

The result was unambiguous: RoBERTa outperformed ModernBERT by 6x on perplexity (4.01 vs 26.24). This was not a marginal difference. ModernBERT, despite its modern attention patterns and efficient architecture, fundamentally underperformed the classic RoBERTa design on codon sequences.

What We Learned

  1. Pre-trained NLP weights do not transfer to biology

We initialized ModernBERT from its published English-language checkpoint, expecting the learned attention patterns to provide a useful starting point. They did not. Our best explanation: ModernBERT's pre-training on English text instilled inductive biases (subword frequency distributions, positional attention patterns) that actively interfere with learning codon statistics. RoBERTa, initialized randomly and trained purely on biological data, had no such baggage. This aligns with what the field has seen more broadly: ESM-2 and ProtTrans both train from scratch on biological data rather than fine-tuning from NLP checkpoints.

  1. Hyperparameter tuning unlocked biological alignment

This was the most surprising and practically important finding of the exploration. Compare CodonRoBERTa-large v1 and v2:

v2 (lr=5e-5, longer warmup)

Same architecture. Same data. Same number of parameters. The only differences: half the learning rate and a longer warmup (2,000 steps vs 1,000). Yet v2's predicted codon likelihoods are 16x better correlated with real biological codon preferences, as measured by Codon Adaptation Index.

The perplexity actually got slightly worse (4.10 vs 4.01), which means v2 is marginally less accurate at predicting the exact masked codon. But it is dramatically better at predicting codons that biology actually uses. The slower training schedule let the model settle into representations that capture genuine biological signal rather than overfitting to surface statistics.

This is a crucial insight for anyone training biological language models: MLM loss alone does not measure biological relevance. Domain-specific metrics are essential. In our case, CAI correlation turned out to be the metric that separates a useful model from a technically impressive but biologically meaningless one.

  1. The base model is remarkably efficient

CodonRoBERTa-base (92M params) achieved nearly identical perplexity to the large model (4.01 vs 4.10) with 3.4x fewer parameters and proportionally less training time. Its CAI correlation (0.219) is lower than v2's (0.404), but still well above the baseline and ModernBERT. For teams without access to multi-GPU clusters, the base model is a practical choice that captures most of the codon modeling performance at a fraction of the cost.

  1. The Pipeline

In Part I, we described the three-stage workflow that most computational protein engineering projects follow: predict structure, design sequences, optimize codons. Here we run each stage with real data and report what we actually got.

Fold: Predict the 3D structure (ESMFold)

Design: Generate sequences that fold into that structure (ProteinMPNN)

Optimize: Choose the best codons for expression (CodonRoBERTa)

3.1 Protein Folding with ESMFold

ESMFold architecture. The model parses a single amino acid sequence through the ESM-2 encoder, then predicts 3D coordinates via a folding trunk and structure module. Figure from Bertoline et al., Biomolecules 2024, CC-BY 4.0.

As covered in Part I, ESMFold is Meta's single-sequence structure predictor. It uses ESM-2, a 15-billion-parameter protein language model trained on 65 million UniRef sequences, as its backbone. The key advantage over AlphaFold 2 is speed: ESMFold skips the computationally expensive multiple sequence alignment (MSA) step and predicts structures directly from a single amino acid sequence. That makes it seconds per protein instead of hours.

The tradeoff is accuracy. ESMFold achieves ~0.87 TM-score on CASP14 targets vs. AlphaFold's ~0.92. For rapid prototyping and candidate screening, that gap is acceptable. When a pipeline generates 100 designed sequences and needs to refold all of them to check viability, speed matters more than the last few percentage points of accuracy.

Our Results: 30 Protein Chains

We ran ESMFold on 30 protein chains sourced from the Protein Data Bank. These are real experimental structures with known ground truth, spanning sequence lengths from 211 to 519 residues. The set deliberately includes both easy targets (single-domain proteins) and challenging ones (chains from a multi-chain ribosomal complex, PDB 7K00) to stress-test the model.

import json # Load our actual results metrics = json.load(open('outputs/esmfold_metrics.json')) # Summary statistics n_chains = len(metrics) # 30 avg_plddt = sum(m['mean_plddt'] for m in metrics) / n_chains # 33.8 avg_ptm = sum(m['ptm'] for m in metrics) / n_chains # 0.79 print(f"Chains: {n_chains}") print(f"Average pLDDT: {avg_plddt:.1f}") print(f"Average PTM: {avg_ptm:.2f}")

Results breakdown:

Chains predicted

Per-residue confidence (lower than expected)

Topology confidence (good)

Sequence lengths

Typical protein sizes

The PTM scores are solid: anything above 0.5 suggests the model has the overall topology correct, and our average of 0.79 indicates high confidence in the predicted folds. The pLDDT scores are lower than published ESMFold benchmarks, which initially concerned us. The explanation turned out to be our test set composition: the ribosomal chains from 7K00 are part of a large multi-chain complex, and ESMFold (which predicts single chains in isolation) cannot model the inter-chain contacts that stabilize these structures. For single-domain proteins in our set, pLDDT scores were consistently above 70.

Running ESMFold

Activate environment source .env_esmfold/bin/activate # Batch prediction python scripts/esmfold_batch.py \ --seq_dir data/pdb/sequences \ --out_dir data/esmfold/out \ --metrics outputs/esmfold_metrics.json \ --device cuda:0

Each prediction takes ~10-30 seconds on an A100. The output includes:

PDB structure files

pLDDT scores (per-residue confidence, 0-100)

PTM scores (topology confidence, 0-1)

Predicted Aligned Error (PAE) matrices

3.2 Sequence Design with ProteinMPNN

ProteinMPNN architecture. (A) The encoder processes backbone atom distances; the decoder autoregressively generates amino acid sequences. (B) Random decoding order improves diversity. (C) Tied positions enable symmetric and multi-state design. Figure from Dauparas et al., Science 2022, CC-BY 4.0.

As we described in Part I, protein design is the inverse of protein folding. Folding goes sequence to structure: given amino acids, predict the 3D shape. Inverse folding goes the other way: given a target 3D shape, find amino acid sequences that will fold into it.

ProteinMPNN, from David Baker's lab at the University of Washington, is the current gold standard for this task. It was published in Science in 2022 and has since been validated experimentally: designed sequences fold into their target structures at rates far exceeding random or earlier computational methods. The architecture treats the protein backbone as a graph, where nodes are amino acid positions and edges connect spatially proximate residues (K-nearest neighbors in 3D). A message-passing neural network propagates information through this graph, then autoregressively generates a sequence one residue at a time.

Our Results: Scaffold 7K00

We ran ProteinMPNN on PDB structure 7K00 (a large multi-chain ribosomal complex):

python proteinmpnn/protein_mpnn_run.py \ --pdb_path data/pdb/raw/7K00.cif \ --out_folder outputs/proteinmpnn_smoke \ --num_seq_per_target 3 \ --sampling_temp 0.1

Sequences generated

Sequence recovery

Here's what the output looks like:

7K00, score=1.7100, global_score=1.7100 GIREKIKLVSSAGTGHFYTTTKNKRTKPEKLELKKFDPVVRQHVIYKEAKI/MKRTFQPSVLK... >T=0.1, sample=1, score=0.8857, seq_recovery=0.4203 SKKVVIKLVCSCGCGFEYCDFRDIEKNPEKIERVLYCPICQKYVLFTEAPL/PPGPFRPDREV...

The first line is the native (natural) sequence extracted from the crystal structure. Subsequent lines are ProteinMPNN's designed variants. At temperature 0.1 (low randomness), the model recovers ~42% of the original amino acids, purely from 3D geometry. This is a strong result: it means the model independently rediscovered nearly half the residues that evolution selected, using only the backbone coordinates as input.

Several practical notes from running ProteinMPNN. Scores are negative log-likelihoods, so lower is better. The 42% recovery rate is typical for well-resolved structures and consistent with the original paper's benchmarks. Higher sampling temperatures produce more diverse but riskier sequences. For real design work, the most powerful feature is partial design: catalytic residues, binding site amino acids, or any positions with known functional importance can be fixed in place, while ProteinMPNN redesigns only the scaffold around them. This is the standard approach for engineering a more stable version of an enzyme without disrupting its active site.

3.3 mRNA Optimization

This is where the pipeline transitions from existing tools to our own models. ESMFold and ProteinMPNN are established, well-validated software that we integrated. Codon optimization is where we built something new.

Why Codon Choice Matters

Codon usage frequencies vary dramatically between organisms. These heatmaps compare codon preferences across E. coli, yeast, and CHO cells, the three expression hosts covered by our multi-species models. Figure from Kim et al., J. Microbiol. Biotechnol. 2025, CC-BY 4.0.

The genetic code is degenerate: most amino acids are encoded by multiple codons. Leucine, for example, has six: TTA, TTG, CTT, CTC, CTA, and CTG. All six produce the same amino acid in the final protein. Methionine and tryptophan are the exceptions, with only one codon each.

This redundancy means that for any given protein, there are astronomically many DNA sequences that encode it. A typical 300-amino-acid protein has roughly 10^150 possible codon combinations. They all produce the same amino acid chain, but they do not all produce the same amount of protein. Codon choice affects translation speed (because tRNA molecules are not equally abundant for all codons), mRNA stability (because the nucleotide sequence affects how quickly the transcript degrades), co-translational folding (because translation pauses at rare codons give the protein time to fold), and immune recognition (because the innate immune system in mammalian cells can distinguish native from foreign mRNA patterns). In practice, bad codon choices can reduce protein expression by 100x. This is why every mRNA vaccine, every recombinant protein therapeutic, and every gene therapy vector goes through codon optimization.

The Traditional Approach and Why It Is Limited

The scale of the codon optimization problem. For a typical mRNA, there are over 10^600 possible codon sequences encoding the same protein. The challenge is finding the arrangement that maximizes expression. Figure from Zhang et al. (LinearDesign), Nature 2023, CC-BY 4.0.

The classical method is simple: measure which codons appear most frequently in highly-expressed genes of the target organism, then replace every codon with the most frequent synonym. This is codified as the Codon Adaptation Index (CAI), a per-sequence score that measures how closely the codon usage matches the organism's preferred distribution.

CAI-based optimization works, but it is crude. It treats each codon position independently, ignoring the sequence context. It produces repetitive sequences (the same "optimal" codon used everywhere for a given amino acid), which can cause ribosome stalling and mRNA secondary structure problems. And it misses complex dependencies: the optimal codon at position 50 might depend on what codons are at positions 48 and 52, which a frequency table cannot capture.

Our Approach: Masked Language Modeling

We reframe codon optimization as a language modeling problem. Instead of looking up frequencies in a table, we train a transformer on hundreds of thousands of natural coding sequences using masked language modeling (MLM), the same pre-training objective used by BERT, RoBERTa, and Meta's ESM protein models. The model sees a codon sequence with 15% of positions masked and learns to predict the missing codons from context.

What the model learns, implicitly, is the grammar of codon usage: which codon patterns appear in nature, which codons tend to co-occur, and how preferences shift depending on the surrounding sequence context. This is fundamentally richer than a frequency table because the model captures long-range dependencies across the entire coding sequence.

CodonRoBERTa: Our Best Model

After our architecture exploration (see above), CodonRoBERTa-large-v2 emerged as the winner:

configs/mrna/production/roberta_large_v2.yaml model_type: roberta hidden_size: 1024 num_hidden_layers: 24 num_attention_heads: 16 intermediate_size: 4096 vocab_size: 69 max_position_embeddings: 8192 learning_rate: 5e-5 # Critical: lower than v1 warmup_steps: 2000 # Critical: longer warmup max_steps: 25000

python scripts/training/run_mlm_train.py \ --config configs/mrna/roberta_large_v2.yaml \ --train_file data/mrna/processed/train_250k.fasta \ --output_dir outputs/models/CodonRoBERTa-large-v2

Evaluation: Three Metrics That Matter

Evaluating a codon language model is not straightforward. As we learned from the v1/v2 comparison above, a model can have excellent perplexity (accurately predicting masked codons) while having poor biological alignment (predicting codons that nature does not actually prefer). We evaluate on three complementary axes:

  1. Perplexity measures how well the model predicts masked codons, computed as the exponentiated cross-entropy loss. A perplexity of 4.10 means the model is, on average, choosing between ~4 equally likely codons at each masked position. Given that most amino acids have 2-6 synonymous codons, this indicates the model has learned meaningful preferences rather than guessing uniformly. Lower is better. CodonRoBERTa-large-v2: 4.10.
  1. CAI Correlation (Spearman) measures whether a model's predicted codon likelihoods align with known biological codon usage preferences. We compute the Codon Adaptation Index for each test sequence, then correlate it with the model's pseudo-log-likelihood score. A positive correlation means the model assigns higher probability to sequences that biology actually uses. This is the metric that matters most for practical codon optimization, because it directly measures whether the model has learned biologically relevant patterns vs. just statistical ones. CodonRoBERTa-large-v2: 0.404 (p < 10^-20).
  1. Synonymous Recovery asks: when the model predicts a codon for a masked position, does it at least get the amino acid right? Even if it picks the wrong synonym (e.g., CTT instead of CTC for leucine), predicting the correct amino acid shows the model understands the protein-level constraint. CodonRoBERTa-large-v2: 12.1% top-1 synonymous.

Running the Evaluations

Perplexity python scripts/evals/advanced/eval_perplexity.py \ --model outputs/models/CodonRoBERTa-large-v2/final \ --test_file data/mrna/processed/test_6k.fasta \ --output outputs/eval_results/CodonRoBERTa-large-v2/perplexity.json # CAI Correlation python scripts/evals/advanced/eval_cai_correlation.py \ --model outputs/models/CodonRoBERTa-large-v2/final \ --test_file data/mrna/processed/test_6k.fasta \ --output outputs/eval_results/CodonRoBERTa-large-v2/cai_correlation.json # Synonymous Recovery python scripts/evals/advanced/eval_synonymous_recovery.py \ --model outputs/models/CodonRoBERTa-large-v2/final \ --test_file data/mrna/processed/test_6k.fasta \ --output outputs/eval_results/CodonRoBERTa-large-v2/synonymous_recovery.json

The Final Leaderboard

Putting it all together across our model variants:

CodonRoBERTa-large-v2

CodonRoBERTa-base

Limited compute

CodonRoBERTa-large

ModernBERT-base

CodonBERT (baseline)

The RoBERTa family dominates across the board. For production use, CodonRoBERTa-large-v2 is the clear choice: it has the strongest biological alignment (CAI 0.404) while maintaining competitive perplexity. For teams with limited compute, CodonRoBERTa-base delivers nearly the same perplexity at 3.4x fewer parameters. ModernBERT underperformed substantially, which we attribute to its NLP-pretrained weights interfering with codon pattern learning.

Using the Model

from transformers import RobertaForMaskedLM import torch # Load model (available soon on Hugging Face) model = RobertaForMaskedLM.from_pretrained("OpenMed/CodonRoBERTa-large-v2") tokenizer = CodonTokenizer() # Our custom 69-token vocabulary # Score a sequence sequence = "ATG GCT AAA GGT..." # Space-separated codons inputs = tokenizer(sequence, return_tensors='pt') with torch.no_grad(): outputs = model(**inputs) # Pseudo-log-likelihood gives a "natura

この記事をシェア

関連記事

Understanding AI★42026年1月29日 00:39

オープンソースタンパク質折りたたみモデルの予想外の味方:大手製薬会社

Google DeepMindのAlphaFoldがタンパク質構造予測で画期的な成果を上げ、2024年ノーベル化学賞を受賞した。しかし学界からは批判も根強い。本記事は、このオープンソースモデルを支持する予想外のパートナーとして大手製薬企業の役割に焦点を当てる。

404 Media2026年5月9日 22:00

科学者がマフィアの結婚906件を調査し、驚くべき事実を発見

科学者らは過去に906組存在したマフィア関係者の結婚記録を分析し、従来の認識とは異なる意外な傾向やパターンを明らかにしました。

TechCrunch AI★32026年4月22日 22:00

AIが過去最多の潜在医薬品を生成する中、このスタートアップは有用な候補を見極めようとしている

スタートアップ「10x Science」は480万ドルのシード資金を調達し、AIが生成した多数の潜在医薬品から有用な分子を選別する支援を行う。

ニュース一覧に戻る元記事を読む