プロテオーム規模でのタンパク質構造予測を加速する方法
NVIDIAの開発者ブログは、タンパク質が単独ではなく複合体として機能することを前提に、プロテオーム規模でのタンパク質構造予測を高速化する方法について解説している。
キーポイント
タンパク質複合体予測の重要性
記事は、多くの生物学的プロセスが単体のタンパク質ではなく、他のタンパク質と相互作用する複合体によって支配されていることを指摘し、この複合体の構造予測が重要であると述べている。
プロテオーム規模での高速化アプローチ
記事の核心は、個々のタンパク質だけでなく、プロテオーム(生物が持つ全タンパク質のセット)規模で、これらの複合体の構造予測をいかに加速するかという技術的アプローチに関するものである。
NVIDIA技術プラットフォームの応用
NVIDIA Developer Blogというソースから、この高速化にはGPUや関連するAI/ハイパフォーマンスコンピューティング技術の活用が想定され、実用的な計算基盤の提供が背景にある。
影響分析・編集コメントを表示
影響分析
この記事は、AI駆動の構造生物学が、画期的な単一タンパク質予測(AlphaFold2など)の成功から、より現実的な生物学的システム(プロテオーム規模の複合体)の理解と予測という次の段階に進みつつあることを示唆している。計算需要の飛躍的増大は、ハイパフォーマンスコンピューティングとAIの融合領域における重要なビジネス・研究機会を創出する。
編集コメント
AlphaFold2以降の構造生物学AIの進化の方向性を、計算基盤の観点から具体的に示す良質な技術解説。次のブレークスルーはスケーラビリティにあるとのメッセージ性が強い。
imageタンパク質が単体のモノマーとして孤立して機能することは稀です。ほとんどの生物学的プロセスは、タンパク質が他のタンパク質と相互作用し、複合体を形成することで制御されています...
原文を表示
Proteins rarely function in isolation as individual monomers. Most biological processes are governed by proteins interacting with other proteins, forming protein complexes whose structures are described in the hierarchy of protein structure as the quaternary representation.
This represents one level of complexity up from tertiary representations, the 3D structure of monomers, which are commonly known since the emergence of AlphaFold2 and the creation of the Protein Data Bank.
Structural information for the vast majority of complexes remains unavailable. While the AlphaFold Protein Structure Database (AFDB), jointly developed by Google DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI), transformed access to monomeric protein structures, interaction-aware structural biology at the proteome scale has remained a bottleneck with unique challenges:
Massive combinatorial interaction space
High computational cost for multiple sequence alignment (MSA) generation and protein folding
Inference scaling across millions of complexes
Confidence calibration and benchmarking
Dataset consistency and biological interpretability
In recent work, we extended the AFDB with large-scale predictions of homomeric protein complexes generated by a high-throughput pipeline based on AlphaFold-Multimer—made possible by NVIDIA accelerated computing. Additionally, we predicted heteromeric complexes to compare the accuracy of different complex prediction modalities.
In particular, for the predictions of these datasets, we leveraged kernel-level accelerations from MMseqs2-GPU for MSA generation, and NVIDIA TensorRT and NVIDIA cuEquivariance for deep-learning-based protein folding. We then mapped the workload to HPC-scale inference by maximizing the utilization of all available GPUs, including scale-out to multiple clusters.
This blog describes the major principles we adopted to increase protein folding throughput, from adopting libraries and SDKs to optimizations to reduce the computational complexity of the workload. These principles can help you set up a similar pipeline yourself by borrowing from the techniques we used to create this new dataset.
So, if you are a:
Computational biologist scaling structure prediction pipelines
AI researcher training generative protein models
HPC engineer optimizing GPU workloads
Bioinformatician team building structural resources
You will learn how to:
Design a proteome-scale complex prediction strategy
Separate MSA generation from structure inference for efficiency
Scale AlphaFold-Multimer workflows across GPU clusters
Prerequisites
Technical knowledge
Python and shell scripting
SLURM as HPC workload scheduler
Basic structural biology
Familiarity with AlphaFold/ColabFold/OpenFold or similar pipelines
Infrastructure
We describe scaling on a multi-GPU and multi-node NVIDIA DGX H100 Superpod cluster
This cluster includes high-speed storage to store MSAs and intermediate outputs
Software
Access to MMseqs2-GPU
Familiarity with TensorRT
If not using a model with integrated cuEquivariance, knowledge about triangular attention and multiplication operations
Procedure/Steps
- Define the dataset you’d like to compute
Begin by defining the scope of prediction. Because predicting protein complexes can become a combinatorial problem, it’s useful to understand what may be most interesting. In some cases, if your proteomes are small enough, an all-against-all (dimeric) complex prediction might be tractable; however, this could change if you want to predict large datasets of proteomes. Here’s how we decided to go about it:Homomeric complexes: We selected all proteomes represented in the AFDB and sorted them by perceived importance (e.g., proteomes of human concern or commonly accessed). This allowed us to rank proteomes for computation in a particular order, making execution more manageable.
Heteromeric complexes: This is where things can get complicated, fast. For our heteromeric runs, we decided to focus on complexes originating from several reference proteomes and proteomes included in the WHO list of important proteomes. As there’s an intractable number of combinations of complexes that can be derived from these proteomes, for our runs, we focused on dimers (complexes of two proteins), within the same proteome (no inter-proteome complexes) that had “physical” interaction evidence in STRING. As we sought coverage, we decided to consider all interactions reported in STRING for these proteomes, rather than further filtering. Evidence in the literature suggests that filtering for STRING scores >700 can further reduce the number of inputs while increasing the likelihood of well-predicted complexes.
- Decoupling MSA generation from structure prediction
MSA generation and structure inference are both compute-intensive but scale differently, as we recently presented in a white paper. We thus approached these computations as separate steps and implemented separate SLURM pipelines. In general, for optimal use of a node, we set up MSA generation and structure prediction this way.
MSA generation
We generated MSAs using colabfold_search with the MMseqs2-GPU backend. While MMSeqs2-GPU scales across GPUs on a node natively, we chose to spawn one MMseqs2-GPU server process per GPU on a node for easier process management. In colabfold_search, the GPUs are only used for the ungappedfilter stages and not the subsequent alignment stages (which are multithreaded CPU processes). Therefore, we can stack colabfold_search calls and start the next one once the GPU is no longer used by the previous one, by monitoring the colabfold_search output, to reduce GPU idle time. Although this approach oversubscribes CPU resources, in practice, we found that on a DGX H100 node, up to 25% of the overall increase in throughput can be achieved with three staggered colabfold_search processes, at the expense of slower processing of individual input chunks.
On determining reasonable input chunk sizes, there are two factors to consider. Smaller chunk sizes result in more chunks, which means more per-process overheads, such as database loading, which can take a couple of minutes each, even on fast storage. (Pre-staging the databases on the fastest storage available, such as the on-node SSD, helps with throughput as well.) On the other hand, larger chunks take more time to finish. On a SLURM cluster with a job time limit, this results in more unfinished chunks. The sweet spot will depend on the cluster configuration, but for our DGX H100 node with a 4-hour wall time limit, the chunk size of 300 sequences seemed to work well with the staggering colabfold_search approach.
Structure prediction
In order to increase structure prediction throughput, we leveraged both optimizations in data handling for JAX-based folding through ColabFold, as well as accelerated tooling developed at NVIDIA, including TensorRT, and cuEquivariance for OpenFold-based folding.
Deep learning inference parameters
First, we selected inference parameters that struck a good balance between accuracy and speed. Protein inference setup for all deep learning inference pipelines (ColabFold and OpenFold), thus utilized:
Weights: 1x weights from AlphaFold Multimer (model_1_multimer_v3)
Four recycles (with early stopping)
No relaxation
MSAs: frozen MSAs generated through ColabFold-search (using MMseqs2-GPU), as described above
Accuracy validation
Homodimer PDB set (125 proteins)ModelHighMediumAcceptIncorrUsableDockQDockQ>0.8>0.6>0.3>0ColabFold5237122189(72.95%)0.637OpenFold with TensorRT and cuEquivariance5339102092(75.41%)0.647Table 1. A comparison of interface accuracy between ColabFold and OpenFold (accelerated by TensorRT and cuEquivariance) across a benchmark set of 125 homodimer proteins.
As we used different inference pipelines, we performed accuracy validation using a curated benchmark set of 125 X-ray resolved PDB homodimers released after AlphaFold2 was introduced, thus minimizing the potential for information leakage. Predicted complexes for each deep learning implementation were compared against experimental reference structures using DockQ, which evaluates interface accuracy via the fraction of native contacts (Fnat), fraction of non-native contacts (Fnonnat), interface RMSD (iRMS), and ligand RMSD after receptor alignment (LRMS), and assigns standard CAPRI classifications of high, medium, acceptable, or incorrect.
Across the PDB homodimer benchmark, OpenFold accelerated through TensorRT and cuEquivariance reproduces ColabFold interface accuracy, achieving a similar fraction of “high” scoring predictions and comparable mean DockQ scores. This indicates that the accelerated implementations preserve interface-level structural accuracy relative to the ColabFold baseline.
MSA preparation and sequence packing
For ColabFold-based homodimer inferences, higher throughput can be achieved by packing homodimers of equal length into a batch for processing, sorted by their MSA depth in descending order. This reduces the number of JAX recompilations, thereby increasing end-to-end throughput. This trick, however, does not work when processing heterodimers, because the lengths of the individual chains differ.
For OpenFold, whether for homodimers or heterodimers, this packing strategy is not needed, as the method doesn’t require re-compilation. However, given a dependency between sequence length and execution time, reserving longer sequences for individual jobs may be beneficial if operating with specific SLURM runtimes. To further optimize the process, input featurizations (CPU-bound) were performed for the next input query alongside the inference step for the current query (GPU-bound).
Additionally, OpenFold’s throughput was enhanced through the integration of the NVIDIA cuEquivariance library and NVIDIA TensorRT SDK. These modular libraries and SDKs can be leveraged to accelerate operations common in protein structure AI and general inference AI workloads, respectively. We previously described how TensorRT can be leveraged to accelerate OpenFold inference.
- Optimize GPU utilization with SLURM
As alluded to in the previous section, depending on the available hardware, you can increase throughput by “packing” GPUs and nodes. SLURM is a great orchestrator, and we divided the inference workflows in SLURM scripts to:
Pack multiple predictions per node
Match GPU memory to sequence length
Reduce idle time between jobs
Separate short vs long sequence queues
Our workload was mapped to a H100 DGX Superpod HPC system. We could thus deploy inference across NVIDIA H100 GPUs on multi-node clusters, leveraging exclusive execution on a single node, and packing each GPU with as many processes as saturated the GPU utilization for both MSA processing and deep learning inference.
Helpful tips:
Group jobs by total residue length
Monitor GPU memory fragmentation
Use asynchronous I/O to avoid disk bottlenecks
- Making quality predictions accessible to the world
In partnership with EMBL-EBI, the Steineggerlab at Seoul National University, and Google DeepMind, we explored complex structure prediction analysis. We highlight that predicting these biological systems remains challenging. Unlike protein monomer prediction, where predicted Local Distance Difference Test (pLDDT) can inform overall prediction quality, yielding a balanced amount of plausible predictions, in the complex scenario, assessing interface plausibility is much harder. This has to do with the fact that assessing complexes involves global and per-chain confidence metrics, as well as local confidence metrics at the interface.Simply put, is the interface between two monomers plausible, and is it predicted in the right pocket? These questions are much harder to answer than more “local” questions about monomer likelihood, given the very limited data available. Therefore, we make available a set of high-confidence structures through the AlphaFold Database, thereby enabling, for the first time, exploration of protein complexes. We intend to refine our approach further and expand the universe of available protein complexes in the AlphaFold Database.
Getting started
Proteome-scale quaternary structure prediction requires more than just running AlphaFold-Multimer at scale. Success depends on:
Evidence-driven interaction selection
Decoupled and optimized compute workflows
GPU-aware job orchestration
Confidence calibration and validation
Dataset health monitoring
By combining STRING-guided selection, MMseqs2-GPU acceleration, and NVIDIA H100-powered multimer inference, this work extends AFDB into a unified, interaction-aware structural resource.
This infrastructure enables:
Variant interpretation at interfaces
Systems-level structural biology
Drug target validation
Generative protein design benchmarking
Resources
Read more about the project here: https://research.nvidia.com/labs/dbr/assets/data/manuscripts/afdb.pdf
Accelerated libraries and SDKs are available here:
MMseqs2-GPU
NVIDIA cuEquivariance
NVIDIA TensorRT
If you wish to deploy MSA search and protein folding easily, you can get accelerated inference pipelines through NVIDIA’s Inference Microservices (NIMs):
MSA Search NIM
OpenFold2 NIM
The predictions from this effort are available through https://alphafold.com
関連記事
Kubernetes上でSlurmを使用した大規模GPUワークロードの実行
NVIDIAが、オープンソースのクラスタ管理システムSlurmをKubernetesと統合し、大規模GPUワークロードを効率的に管理・スケジューリングする方法を紹介している。SlurmはTOP500システムの65%以上で採用されている実績を持つ。
約30行のPythonとNVIDIA nvCOMPでチェックポイントコストを削減
NVIDIAが、LLM学習時のチェックポイント保存コストを削減するPythonスクリプトを公開した。約30行のコードでモデル重み・オプティマイザ状態・勾配の圧縮保存を実現し、ストレージコストとI/O負荷を低減できる。
NVIDIA GPUに対する新たなRowhammer攻撃がシステム完全制御を可能に
セキュリティ研究者が、NVIDIA GPUを標的とした新たなRowhammer攻撃を実証し、メモリ破損からシステム全体の侵害にエスカレートできることを示した。