Metaの新AIモデルが画像・音声・音声に対する脳の反応を予測
Metaが開発したAIモデルは、画像・音声・音響に対する人間の脳の反応を予測し、その予測は実際の個人の脳スキャンよりも典型的な脳反応に近い結果を示した。
キーポイント
脳反応予測AIモデルの開発
Metaが、画像・音声・音響に対する人間の脳の反応を予測するAIモデルを構築した。
高い予測精度の実証
テストにおいて、このモデルの予測は、実際の個人の脳スキャン結果よりも、典型的な脳反応により近い一致を示した。
マルチモーダル刺激への対応
モデルは視覚(画像)と聴覚(音声・音響)という複数の感覚モダリティに対する脳の反応を予測対象としている。
影響分析・編集コメントを表示
影響分析
この技術は、神経科学とAIの融合を加速させ、脳活動の理解や脳-コンピュータインターフェースの開発に寄与する可能性がある。また、広告効果測定やコンテンツ最適化など、実用的な応用への道も開く。
編集コメント
個人の脳スキャンよりも「典型的な」反応をより正確に予測できる点が興味深く、脳活動の一般化モデル構築に向けた重要な一歩と言える。実用化への道筋と倫理的課題が今後の焦点となる。

Metaは、画像・音声・発話に対する人間の脳の反応を予測するAIモデルを構築しました。テストにおいて、その予測は、実際の個人の脳スキャン結果よりも、典型的な脳の反応により近い一致を示しました。
本記事「Metaの新AIモデルは、画像・音声・発話に対する脳の反応を予測」は、The Decoderに最初に掲載されました。
原文を表示
A new AI model from Meta predicts how the human brain reacts to images, sounds, and speech. In tests, it often matched the typical brain response better than any single person's scan.
Brain research requires new recordings for every new experiment, making neuroscience studies slow and expensive. AI researchers at Meta's FAIR lab want to skip this bottleneck entirely with an AI model that predicts brain activity instead of measuring it.
The model is called TRIBE v2, and it was trained on more than 1,000 hours of fMRI data from 720 subjects, according to the accompanying paper. Functional magnetic resonance imaging (fMRI) measures brain activity indirectly by tracking changes in blood flow and oxygen levels. Using this data, TRIBE v2 aims to predict how a brain responds to any visual, auditory, or language-based stimulus.
TRIBE v2 takes three types of input: video, audio, and text. Each channel runs through a pre-trained Meta AI model first: Llama 3.2 for text, Wav2Vec-Bert-2.0 for audio, and Video-JEPA-2 for video. These models turn raw data into embeddings that capture what's visible in an image, audible in a sound, or readable in a sentence.
A transformer then processes all three representations together, picking up patterns that hold across different stimuli, tasks, and people. A final person-specific layer translates the output into a brain map with 70,000 voxels, the 3D pixels that make up an fMRI scan.
RIBE v2 takes video, audio, and text as input, runs them through three pre-trained AI models, and predicts brain activity. Below: comparison between measured and predicted brain activity for a movie clip. | Image: Meta
Predictions carry less noise than actual brain scans
Individual fMRI images are inherently noisy. Heartbeat, head movement, and device artifacts all distort the signal. To figure out how a brain typically responds to a given stimulus, researchers have to average many scans together.
TRIBE v2 gets around this by directly predicting an adjusted average response. In testing, this prediction correlated more strongly with the actual group average than most individual subjects' scans did. The effect was strongest in the Human Connectome Project dataset, which was captured with a 7 Tesla scanner, offering much higher signal quality than the standard 3 Tesla machines. On this dataset, TRIBE v2 hit a correlation with the group response twice as high as the median individual subject.
TRIBE v2 generalizes to new subjects without retraining. Green bars show the model's prediction accuracy; gray dots show how well individual brain scans correlate with the group average. Brief fine-tuning (bottom) further improves accuracy. | Image: Meta
Compared to optimized linear models—the previous go-to method for this kind of prediction—TRIBE v2 showed significant improvements across every dataset, according to the paper. The earlier version, TRIBE v1, was trained on just four subjects and predicted only 1,000 voxels instead of 70,000, yet it still won the Algonauts 2025 competition, beating out 263 other teams.
TRIBE v2's prediction accuracy scales steadily with the amount of training data and hasn't plateaued yet. That suggests the model will keep improving as fMRI databases grow, a pattern that echoes the scaling laws of large language models, where more data reliably leads to better performance.
TRIBE v2 predicts brain activity across cortical and subcortical regions. Prediction quality varies by stimulus type and clearly beats the linear baseline. Accuracy scales steadily with more training data (bottom right). | Image: Meta
Decades of lab work replicated on a computer
The researchers tested TRIBE v2 with everyday stimuli like movies and podcasts, where multiple sensory inputs hit the brain at once, as well as with isolated stimuli typical of classical neuroscience. In those controlled setups, a single image might flash on screen for one second to measure a specific brain region's response. The team used test protocols from the Individual Brain Charting dataset, a collection of well-established neuroscience experiments, and had the model predict which brain areas should light up.
In visual experiments, TRIBE v2 identifies known specialized brain regions for faces, places, bodies, and characters. Top: images shown. Middle: model prediction. Bottom: actual measured brain activity. | Image: Meta
In visual experiments with images of faces, places, bodies, and characters, TRIBE v2 correctly pinpointed the known specialized brain regions every time. In language experiments, it localized the language network, distinguished between emotional and physical pain processing, and showed the expected stronger left-hemisphere activation for complete sentences versus word lists.
These results line up with findings from decades of empirical research on real subjects. For neuroscience, the implication seems clear: future experiments could be roughed out on a computer before anyone books expensive lab time.
In language experiments, TRIBE v2 replicates classic neurolinguistic findings like the distinction between speech and silence, emotional and physical pain, or sentences and word lists. Predicted activation patterns match the measured data. | Image: Meta
Mapping which sensory channel activates which brain region
By selectively turning off individual input channels, TRIBE v2 reveals how much each sense drives activity in specific brain regions. The results match existing neuroscience: audio best predicts activity near the auditory cortex, video maps to the visual cortex, and text lights up language areas and parts of the frontal lobe.
In regions where the brain combines input from multiple senses, feeding all three channels delivers the biggest gains. At the junction of the temporal, parietal, and occipital lobes, prediction accuracy jumps by up to 50 percent compared to any single channel alone.
A statistical breakdown of the model's final layer also turned up five patterns that map onto known functional brain networks: the primary auditory cortex, the language network, motion recognition, the default mode network, and the visual system. The default mode network kicks in during daydreaming and self-reflection, among other things.
A passive observer with a limited toolkit
TRIBE v2's limitations are still significant. fMRI only measures brain activity indirectly through blood flow, with a delay of several seconds. The fast-moving dynamics of neural signals in the millisecond range stay hidden. The model also only covers three sensory channels—smell, touch, and balance are all missing.
More fundamentally, TRIBE v2 treats the brain as a passive receiver of sensory input. It doesn't model how the brain actively makes decisions or drives actions. It also can't capture developmental changes or clinical conditions, which the researchers say remains a priority for future versions.
Meta sees three use cases for the model: planning neuroscience experiments, building more brain-like AI architectures, and eventually diagnosing brain diseases. Code, model weights, and an interactive demo are all publicly available.
FAIR, Meta's AI research lab, has been working at the intersection of brains and AI for years. Last year, the team showed that an AI model could reconstruct typed sentences from non-invasive brain scans alone with up to 80 percent accuracy.
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み