カーネギーメロン大学、NeurIPS 2025で156件の研究を発表
カーネギーメロン大学の研究者がNeurIPS 2025で156件の論文を発表。神経情報処理分野の最新研究を紹介し、AI技術の発展に貢献。
キーポイント
カーネギーメロン大学がNeurIPS 2025に参加予定であること
情報源は大学公式の機械学習ブログ(ML@CMU)であること
記事本文は不完全で具体的な研究内容や発表内容は不明であること
影響分析・編集コメントを表示
影響分析
この記事は現時点では単なるイベント参加予告であり、具体的な研究内容や技術的進展に関する情報が欠如しているため、直接的影響は限定的です。ただし、カーネギーメロン大学のNeurIPS参加自体はAI研究コミュニティにおける同大学の継続的な関与を示すものです。
編集コメント
記事内容が不完全なため、具体的な技術的評価はできません。今後の詳細発表を待つ必要があります。
カーネギーメロン大学、NeurIPS 2025にて
カーネギーメロン大学、NeurIPS 2025 – 機械学習ブログ | ML@CMU | カーネギーメロン大学 検索キーワードを入力し、Enterキーを押してください。 カテゴリー:
[instagram-feed num=6 cols=6 imagepadding=0 disablemobile=true showbutton=false showheader=false followtext=”@Mint_Themeをフォロー”]
原文を表示
CMU researchers are presenting 156 papers at the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025), held from December 2nd-December 7th at the San Diego Convention. Here is a quick overview of the areas our researchers are working on:

Here are our most frequent collaborator institutions:

Table of Contents
- Oral Papers
- Spotlight Papers
- Poster Papers
Applications
- Computer Vision
- Data-centric AI
- Deep Learning
- General Machine Learning
- Optimization
- Probabilistic Methods
- Reinforcement Learning
- Social Aspects
- Theory
- Uncategorized
- Tutorials
Oral Papers
Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain
Authors: Trinity Chung (Carnegie Mellon University), Yuchen Shen (Carnegie Mellon University), Nathan Kong (MIT), Aran Nayebi (School of Computer Science, Carnegie Mellon University)
This paper introduces an Encoder–Attender–Decoder (EAD) framework to study task-optimized neural networks for tactile processing using realistic whisker-based simulations. Convolutional recurrent neural networks (ConvRNNs) emerge as the most effective encoders, both for tactile categorization and for producing representations that closely match activity in rodent somatosensory cortex, revealing a linear link between task performance and neural alignment. Notably, self-supervised contrastive ConvRNN models achieve neural fits comparable to supervised training, indicating that label-free learning can capture biologically relevant tactile representations. These findings highlight the importance of recurrent processing for understanding cortical tactile computation and for building robust embodied AI systems.
MaxSup: Overcoming Representation Collapse in Label Smoothing
Authors: Yuxuan Zhou (CISPA Helmholtz Center for Information Security), Heng Li (Carnegie Mellon University), Zhi-Qi Cheng (University of Washington), Xudong Yan (City University of Macao), Yifei Dong (Carnegie Mellon University), Mario Fritz (CISPA Helmholtz Center for Information Security), Margret Keuper (University of Mannheim)
Label Smoothing is commonly used to reduce overconfidence and improve generalization, but it can paradoxically increase confidence in misclassified samples and collapse feature representations. This work analytically decomposes the LS loss, revealing an error-amplification term that strengthens incorrect predictions and drives representation collapse. To overcome this, the authors propose Max Suppression (MaxSup), which regularizes predictions uniformly by penalizing the top-1 logit instead of the ground-truth logit. Experiments show that MaxSup preserves intra-class diversity, improves class separation, and consistently outperforms LS across large-scale classification and downstream tasks.
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond))
Authors: Liwei Jiang (University of Washington), Yuanjun Chai (University of Washington), Margaret Li (University of Washington), Mickel Liu (University of Washington), Raymond Fok (University of Washington), Nouha Dziri (Allen Institute for AI), Yulia Tsvetkov (Department of Computer Science, University of Washington), Maarten Sap (Carnegie Mellon University), Yejin Choi (UW => Stanford / NVIDIA)
This paper introduces INFINITY-CHAT, a large-scale dataset of 26,000 diverse open-ended user queries and a comprehensive taxonomy of prompt types to evaluate creativity and diversity in language model outputs. Using this resource, the authors identify a pronounced “Artificial Hivemind” effect marked by both repetitive responses within a single model and striking similarities across different models. The dataset also includes over 31,000 human annotations enabling analysis of collective and individual preferences. Results show that existing models and evaluation methods are poorly calibrated to idiosyncratic human judgments, highlighting risks of homogenized AI outputs.
Mean Flows for One-step Generative Modeling
Authors: Zhengyang Geng (CMU), Mingyang Deng (Massachusetts Institute of Technology), Xingjian Bai (Massachusetts Institute of Technology), Zico Kolter (Carnegie Mellon University), Kaiming He (MIT)
The authors introduce MeanFlow, a principled one-step generative modeling framework based on the concept of average velocity rather than the instantaneous velocity used in prior flow-matching methods. The authors derive a formal identity linking average and instantaneous velocities to guide neural network training in a self-contained approach requiring no pretraining, distillation, or curriculum learning. MeanFlow achieves strong results, including a 3.43 FID on ImageNet 256×256 with a single function evaluation, outperforming previous one-step models. These results substantially narrow the performance gap between one-step and multi-step diffusion and flow-based methods.
Spotlight Papers
OpenCUA: Open Foundations for Computer-Use Agents
Authors: Xinyuan Wang (University of Hong Kong), Bowen Wang (University of Hong Kong), Dunjie Lu (SUN YAT-SEN UNIVERSITY), Junlin Yang (Tsinghua University), Tianbao Xie (the University of Hong Kong, University of Hong Kong), Junli Wang (Alibaba Group), Jiaqi Deng (The University of Hong Kong), Xiaole Guo (University of Hong Kong), Yiheng Xu (University of Hong Kong), Chen Wu (Carnegie Mellon University), Zhennan Shen (Shanghai Jiaotong University), Zhuokai Li (University of Hong Kong), Ryan Li (Computer Science Department, Stanford University), Xiaochuan Li (Tsinghua University), Junda Chen (Harbin Institute of Technology), Boyuan Zheng (The University of Hong Kong), Li Peihang (University of Hong Kong), Fangyu Lei (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Ruisheng Cao (Shanghai Jiaotong University), Yeqiao Fu (University of Hong Kong), Dongchan Shin (University of Hong Kong), Martin Shin (University of Hong Kong), Hu Jiarui (University of Hong Kong), Yuyan Wang (Johns Hopkins University), Jixuan Chen (University of California, San Diego), Yuxiao Ye (The Hong Kong University of Science and Technology), Danyang Zhang (Shanghai Jiao Tong University), Yipu Wang (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Heng Wang (University of Illinois Urbana-Champaign), Diyi Yang (Stanford University), Victor Zhong (University of Waterloo), Y.Charles (Moonshot AI), Zhilin Yang (Tsinghua University, Tsinghua University), Tao Yu (University of Hong Kong)
This paper introduces OpenCUA, an open-source framework designed to enable transparent research into computer-use agents built with vision–language models. The framework includes an annotation system for collecting human demonstrations, AgentNet, a large-scale dataset spanning three operating systems and 200+ applications, and a scalable pipeline that converts demonstrations into state–action data with reflective chain-of-thought reasoning. End-to-end agent models trained with OpenCUA show strong benchmark performance, with OpenCUA-72B achieving a 45.0% success rate on OSWorld-Verified, setting a new open-source state of the art.
Authors: Jiatong Shi (Carnegie Mellon University), Yifan Cheng (Huazhong University of Science and Technology), Bo-Hao Su (Carnegie Mellon University), Hye-jin Shim (Carnegie Mellon University), Jinchuan Tian (Carnegie Mellon University), Samuele Cornell (Università Politecnica delle Marche), Yiwen Zhao (School of Computer Science, Carnegie Mellon University), Siddhant Arora (Carnegie Mellon University), Shinji Watanabe (Carnegie Mellon University)
This work presents ARECHO, an autoregressive chain-based framework for jointly evaluating multiple speech quality metrics such as PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score), which traditionally differ in scale and assumptions. ARECHO introduces a comprehensive tokenization pipeline, a dynamic classifier chain to model inter-metric dependencies, and a confidence-oriented two-step decoding scheme to improve inference reliability. Experiments show that ARECHO consistently outperforms baseline methods across speech enhancement, generation evaluation, and noisy-speech scenarios. The approach also improves interpretability and flexibility by enabling reference-free evaluation and subset metric queries.
UMA: A Family of Universal Models for Atoms
Authors: Brandon Wood (FAIR at Meta), Misko Dzamba (Facebook), Xiang Fu (Periodic Labs), Meng Gao (Facebook), Muhammed Shuaibi (FAIR, Meta), Luis Barroso-Luque (Facebook), Kareem Abdelmaqsoud (Carnegie Mellon University), Vahe Gharakhanyan (Meta), John Kitchin (Carnegie Mellon University), Daniel Levine (Meta FAIR), Kyle Michel (Meta), Anuroop Sriram (Meta FAIR), Taco Cohen (Meta / FAIR), Abhishek Das (FAIR, Meta AI), Sushree Sahoo (Facebook), Ammar Rizvi (Meta), Zachary Ulissi (FAIR, Meta AI), Larry Zitnick (Fundamental AI Research at Meta AI)
This paper introduces Universal Models for Atoms (UMA), a family of large-scale models designed to rapidly and accurately predict properties from atomic simulations across chemistry and materials science. Trained on over 500 million unique 3D atomic structures spanning molecules, materials, and catalysts, UMA leverages empirical scaling laws and a novel mixture-of-linear-experts architecture to increase capacity without sacrificing speed. Evaluations show that a single UMA model, without fine-tuning, matches or outperforms specialized models across diverse applications.
A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search
Authors: Arnav Kumar Jain (University de Montreal), Vibhakar Mohta (Nuro Inc.), Subin Kim (Korea Advanced Institute of Science & Technology), Atiksh Bhardwaj (Cornell University), Juntao Ren (Stanford University), Yunhai Feng (Cornell University), Sanjiban Choudhury (Cornell University), Gokul Swamy (Carnegie Mellon University)
This work addresses a key limitation of behavioral cloning (BC) in imitation learning: BC only teaches an agent to mimic expert actions at states the expert visited, leaving it unable to recover from mistakes. To overcome this, the authors propose SAILOR, which leverages learning to search (L2S) by training a world model and a reward model to plan and recover toward expert outcomes even after errors. SAILOR achieves stable and sample-efficient learning without additional human corrections and consistently outperforms state-of-the-art diffusion-policy BC methods across visual manipulation benchmarks. It also demonstrates robustness to nuanced failures and reward hacking, and the performance gap persists even when BC is trained with 5–10x more demonstrations.
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
Authors: Jiajun Shi (Beijing University of Aeronautics and Astronautics), Jian Yang (Alibaba Group), Jiaheng Liu (Nanjing University), Xingyuan Bu (Alibaba Group), Jiangjie Chen (ByteDance Seed), Junting Zhou (Peking University), Kaijing Ma (Tongji University), Zhoufutu Wen (ByteDance Inc.), Bingli Wang (Sichuan Agricultural University), Yancheng He (Alibaba Group), Liang Song (M-A-P), Hualei Zhu (Beijing University of Aeronautics and Astronautics), Shilong Li (Beijing University of Posts and Telecommunications), Xingjian Wang (Shanghai University of Electric Power), Wei Zhang (Beijing University of Aeronautics and Astronautics), Ruibin Yuan (Carnegie Mellon University), Yifan Yao (Beijing University of Posts and Telecommunications), Wenjun Yang (University College London, University of London), Yunli Wang (Kuaishou Technology), Siyuan Fang (Beijing University of Posts and Telecommunications), Siyu Yuan (Fudan University), Qianyu He (Fudan University), Robert Tang (Yale University), Yingshui Tan (Alibaba Group), Wangchunshu Zhou (Guangdong OPPO Mobile Telecommunications Corp.,Ltd.), ZHAO-XIANG ZHANG (Chinese Academy of Sciences, China), Zhoujun Li (Beijing University of Aeronautics and Astronautics), Wenhao Huang (Key Laboratory of Machine Perception), Ge Zhang (University of Michigan – Ann Arbor)
The authors introduce KORGym, a dynamic evaluation platform designed to comprehensively assess the reasoning abilities of large language models (LLMs) and vision-language models (VLMs). Unlike existing domain-specific benchmarks, KORGym offers over 50 interactive games in textual and visual formats, including multi-turn and reinforcement learning scenarios. Experiments on 19 LLMs and 8 VLMs reveal consistent reasoning patterns within model families and highlight the superior performance of closed-source models. The platform also enables analysis of factors such as modality, reasoning strategies, reinforcement learning approaches, and response length, providing a robust tool for advancing reasoning evaluation in complex environments.
Towards Understanding Camera Motions in Any Video
Authors: Zhiqiu Lin (Carnegie Mellon University), Siyuan Cen (University of Massachusetts at Amherst), Daniel Jiang (Carnegie Mellon University), Jay Karhade (CMU, Carnegie Mellon University), Hewei Wang (Carnegie Mellon University), Chancharik Mitra (CMU, Carnegie Mellon University), Yu Tong Tiffany Ling (CMU, Carnegie Mellon University), Yuhan Huang (Carnegie Mellon University), Rushikesh Zawar (Carnegie Mellon University), Xue Bai (Adobe Systems), Yilun Du (Google Deepmind / Harvard), Chuang Gan (IBM), Deva Ramanan (Carnegie Mellon University)
This work presents CameraBench, a large-scale dataset and benchmark for evaluating camera motion understanding, comprising roughly 3,000 diverse videos annotated through a rigorous expert-driven process. A key contribution is a taxonomy of camera motion primitives, developed with cinematographers, which captures motions that require both geometric and semantic understanding. Human studies show that domain expertise and targeted training significantly improve motion recognition, such as distinguishing zoom from forward translation. Evaluations reveal that Structure-from-Motion models struggle with semantic motions, while generative video-language models struggle with geometric ones, and fine-tuning a generative VLM on CameraBench enables strong performance across motion-augmented captioning, video QA, and video-text retrieval tasks.
Enhancing Training Data Attribution with Representational Optimization
Authors: Weiwei Sun (Carnegie Mellon University), Haokun Liu (Department of Computer Science, University of Toronto), Nikhil Kandpal (Department of Computer Science), Colin Raffel (University of Toronto, Vector Institute and Hugging Face), Yiming Yang (CMU)
This paper presents AirRep, a scalable representation-based method for training data attribution (TDA) that learns task-specific, model-aligned representations optimized for measuring how training data affects model predictions. AirRep features a trainable encoder for attribution quality and an attention-based pooling mechanism to estimate group-wise influence accurately. Trained using a ranking objective over subsets labeled by their empirical effect, AirRep matches the performance of gradient-based methods like influence functions while being nearly 100× more efficient at inference.
Checklists Are Better Than Reward Models For Aligning Language Models
Authors: Vijay Viswanathan (Carnegie Mellon University), Yanchao Sun (University of Maryland, College Park), Xiang Kong (Apple), Meng Cao (Apple), Graham Neubig (Carnegie Mellon University), Sherry Wu (Carnegie Mellon University)
This work introduces Reinforcement Learning from Checklist Feedback (RLCF), a method for improving instruction-following in language models using flexible, instruction-specific criteria rather than fixed metrics like helpfulness or harmfulness. RLCF extracts checklists from instructions and evaluates responses against each item using AI judges and verifier programs to compute rewards for reinforcement learning. Applied to models like Qwen2.5-7B-Instruct, RLCF improves performance across five benchmarks, achieving notable gains in hard satisfaction rates and win rates, and can also enhance other models off-policy, such as Llama 3.1 8B Instruct and OLMo 2 7B Instruct. The authors release their WildChecklists dataset, models, and code to support further research in flexible instruction alignment.
Extrapolation by Association: Length Generalization Transfer In Transformers
Authors: Ziyang Cai (Princeton University), Nayoung Lee (University of Wisconsin-Madison), Avi Schwarzschild (Carnegie Mellon University), Samet Oymak (University of Michigan – Ann Arbor), Dimitris Papailiopoulos (University of Wisconsin-Madison)
This paper studies length generalization in transformer language models—the ability to handle longer inputs than seen during training—through the concept of task association. The authors show that training on a longer, related auxiliary task can improve generalization to longer inputs on a target task across algorithmic domains like arithmetic, string manipulation, and maze navigation. They find similar transfer effects in pretrained language models, suggesting pretraining provides reusable computational scaffolding. Mechanistic analysis indicates that this length generalization transfer is linked to the reuse of attention heads between tasks, highlighting how transformers leverage compositional inductive structures.
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation
Authors: Xinyu Yang (CMU), Yuwei An (Carnegie Mellon University), Hongyi Liu (Carnegie Mellon University), Tianqi Chen (Carnegie Mellon University), Beidi Chen (CMU / Amazon)
This work introduces Multiverse, a generative model that enables natively parallel generation by internalizing a MapReduce paradigm with Map, Process, and Reduce stages. The approach includes Multiverse Curator for automated data creation, Multiverse Attention for separating parallel reasoning steps, and Multiverse Engine for dynamic sequential-parallel inference. After minimal fine-tuning, Multiverse-32B matches leading autoregressive LLMs in performance while achieving up to 2× speedup and better scaling efficiency. The authors have open-sourced the full Multiverse ecosystem, including models, data, serving systems, and training pipelines.
Thought Communication in Multiagent Collaboration
Authors: Yujia Zheng (Carnegie Mellon University), Zhuokai Zhao (Meta), Zijian Li (Mohamed bin Zayed University of Artificial Intelligence), Yaqi Xie (CMU), Mingze Gao (Meta Inc.), Lizhu Zhang (Meta), Kun Zhang (CMU & MBZUAI)
This work introduces thought communication, a paradigm for multi-agent interaction that goes beyond natural language by enabling agents to share latent, mind-like representations directly. The authors formalize this process as a latent variable model, proving that both shared and private thoughts, as well as the global structure of thought sharing among agents, can be identified and recovered with theoretical guarantees. They develop a framework that extracts and distributes relevant latent thoughts to agents, enhancing collaboration across modalities. Experiments on synthetic and real-world benchmarks validate the approach, showing that thought communication can unlock collaborative advantages beyond what is possible with surface-level language-based exchanges.
Cost-aware LLM-based Online Dataset Annotation
Authors: Eray Can Elumar (CMU, Carnegie Mellon University), Cem Tekin (Bilkent University), Osman Yagan (Carnegie Mellon University)
This paper introduces CaMVo, a method for labeling datasets with large language models (LLMs) while keeping costs low. Instead of querying many LLMs for every example, CaMVo adaptively chooses only a few models based on how confident they are likely to be. It uses ideas from contextual bandits (LinUCB) and a Bayesian confidence estimator to decide which models to query and how to weight their votes—without needing any ground-truth labels. Experiments on MMLU and IMDB show that CaMVo matches or beats full majority voting but with far fewer LLM calls, making it a practical approach for efficient large-scale annotation.
Conformal Mixed-Integer Constraint Learning with Feasibility Guarantees
Authors: Daniel Ovalle (Carnegie Mellon University), Lorenz Biegler (Carnegie Mellon University), Ignacio Grossmann (CMU, Carnegie Mellon University), Carl Laird (Carnegie Mellon University), Mateo Dulce Rubio (CMU)
The authors introduce C-MICL, a framework for learning constraints in optimization problems while guaranteeing that the resulting solutions remain feasible with high probability. Traditional learned constraints can fail due to model error or limited data, but C-MICL uses conformal prediction to add uncertainty-aware adjustments that ensure feasibility at a user-specified confidence level. The method works for both regression- and classification-based constraint learning and avoids the heavy computational overhead of ensemble approaches. Experiments show that C-MICL reliably meets feasibility targets, preserves strong optimization performance, and is significantly more efficient, offering a principled way to blend machine learning with safe decision-making.
SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications
Authors: Gabriele Oliaro (Carnegie Mellon University), Zhihao Jia (School of Computer Science, Carnegie Mellon University), Daniel Campos (Zipf AI), Aurick Qiao (Snowflake)
The authors present SuffixDecoding, a new speculative decoding method tailored for emerging AI workloads like LLM-based agents, which generate long, repetitive, and predictable sequences. Unlike existing speculative decoding approaches designed for diverse, independent requests, SuffixDecoding uses suffix trees to efficiently cache and reuse long stretches of past tokens from prompts and model outputs. It adaptively adjusts how many tokens to speculate—expanding aggressively when predictions are likely to be accepted and backing off when uncertainty is higher. Experiments on agent-style tasks such as SWE-Bench and Text-to-SQL show that SuffixDecoding can deliver up to 3.9× speedups, making it well suited for fast, iterative agentic inference.
Horizon Reduction Makes RL Scalable
Authors: Seohong Park (UC Berkeley), Kevin Frans (UC Berkeley), Deepinder Mann (UC Berkeley), Benjamin Eysenbach (Princeton), Aviral Kumar (Carnegie Mellon University), Sergey Levine (UC Berkeley)
This paper examines why offline reinforcement learning (RL) often fails to scale, even when given massive datasets, large models, and ample compute. The authors find that long decision horizons—the number of steps required to propagate rewards—are a key bottleneck that prevents standard offline RL algorithms from improving with more data. Through extensive experiments, they show that reducing the effective horizon dramatically improves scalability and performance on challenging tasks. Building on this insight, they introduce SHARSA, a simple horizon-reduction method that achieves the strongest scaling behavior and best asymptotic performance across their benchmarks.
To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable RL
Authors: Yuda Song (Carnegie Mellon University), Dhruv Rohatgi (Massachusetts Institute of Technology), Aarti Singh (CMU), J. Bagnell (Carnegie Mellon University)
This paper studies when it’s better to distill privileged expert policies—which have access to latent state information during training—versus directly learning from partial observations in reinforcement learning. Using a simple theoretical model (the perturbed Block MDP) and controlled locomotion experiments, the authors show that the trade-off depends strongly on how stochastic the underlying latent dynamics are. When the latent state is easy to infer, distillation works well, but when it is highly stochastic, imitating the latent optimal policy can actually hurt performance. The results provide practical guidance: the best latent policy isn’t always the best one to distill, and deciding when to distill versus directly learning depends on the underlying uncertainty structure of the task.
Authors: Alexander Goldberg (Computer Science Department, School of Computer Science), Giulia Fanti (CMU), Nihar Shah (CMU)
MERIT is a principled framework for using randomized selection in settings like peer review or grant funding, where evaluations are noisy and uncertainty can make deterministic rankings unreliable. Instead of relying on ad-hoc randomization, MERIT uses interval estimates (e.g., confidence intervals) to model uncertainty and then optimizes for the worst-case expected number of true top-k items selected. The authors develop a polynomial-time algorithm that scales to large datasets and show that MERIT satisfies desirable fairness and robustness properties that existing methods lack. Experiments on synthetic peer-review data show that MERIT matches prior probabilistic methods in expected performance while providing stronger guarantees in worst-case scenarios.
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Authors: Thomas Kuntz (EPFL – EPF Lausanne), Agatha Duzan (EPFL – EPF Lausanne), Hao Zhao (EPFL – EPF Lausanne), Francesco Croce (University of Tübingen), Zico Kolter (Carnegie Mellon University), Nicolas Flammarion (EPFL), Maksym Andriushchenko (ELLIS Institute Tübingen and MPI-IS)
OS-Harm is a benchmark for evaluating the safety of LLM-based computer use agents that interact directly with operating system interfaces. OS-Harm tests agents across three harm categories—deliberate misuse, prompt injection attacks, and model misbehavior—using 150 tasks spanning applications like email, browsers, and code editors. An automated judge evaluates both task performance and safety, achieving strong agreement with human annotations. Evaluations of leading agents reveal that models often comply with unsafe commands, are vulnerable to prompt injections, and sometimes take unsafe actions, highlighting the need for robust safety measures in these systems.
Can We Infer Confidential Properties of Training Data from LLMs?
Authors: Pengrun Huang (University of California, San Diego), Chhavi Yadav (CMU), Kamalika Chaudhuri (FAIR, Meta and UCSD), Ruihan Wu (University of California, San Diego)
PropInfer is a benchmark designed to evaluate whether large language models (LLMs) can leak sensitive properties of the datasets used for fine-tuning, particularly in domains like healthcare. It tests property inference under both question-answering and chat-completion setups. Two tailored attacks—a prompt-based generation attack and a shadow-model attack leveraging word frequency—are proposed to extract dataset-level information. Empirical results show that these attacks can succeed across multiple pretrained LLMs, revealing an important and previously underexplored privacy risk.
Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?
Authors: Hyeong Kyu Choi (University of Wisconsin-Madison, Computer Sciences), Jerry Zhu (Carnegie Mellon University), Sharon Li (University of Wisconsin-Madison)
Multi-Agent Debate (MAD) improves large language model performance by having multiple agents reason collaboratively, but its key drivers were unclear. By separating Majority Voting from inter-agent debate, experiments across seven NLP benchmarks show that most gains come from majority voting rather than the debate itself. A theoretical analysis models debate as a stochastic process, revealing that debate alone doesn’t improve expected correctness, though targeted interventions that bias belief updates can enhance its impact. These results suggest that while MAD has potential, simple ensembling methods often remain a more reliable and effective approach.
The Complexity of Symmetric Equilibria in Min-Max Optimization and Team Zero-Sum Games
Authors: Ioannis Anagnostides (Carnegie Mellon University), Ioannis Panageas (UC Irvine), Tuomas Sandholm (CMU, Strategy Robot, Optimized Markets, Strategic Machine), Jingming Yan (University of California, Irvine)
The study analyzes the complexity of computing equilibria in team-based zero-sum games and symmetric min-max optimization. It shows that finding epsilon-Nash equilibria in 3-player adversarial team games (2 vs. 1) is CLS-complete, resolving an open question about such games. Additionally, computing symmetric equilibria in symmetric min-max problems is PPAD-complete, even for quadratic objectives, and this extends to 6-player team games (3 vs. 3), implying that common symmetric dynamics cannot reliably converge. Finally, computing non-symmetric equilibria with polynomial precision is FNP-hard, highlighting the fundamental difficulty of equilibrium computation in these settings.
Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning
Authors: Emile Anand (Georgia Institute of Technology and Cognition Labs), Ishani Karmarkar (Stanford University), Guannan Qu (Carnegie Mellon University)
Scaling multi-agent reinforcement learning (MARL) is difficult due to the exponential growth of joint state and action spaces as agents increase. SUBSAMPLE-MFQ introduces a method that combines subsampling agents with mean-field Q-learning and a decentralized randomized policy, allowing efficient learning for any subset of k agents. The algorithm’s runtime scales polynomially in k, not the total number of agents n, making it practical for large systems. Theoretical guarantees show that the learned policy converges to the optimal policy at a rate of roughly 1 over root k, independent of the total agent count.
On the Hardness of Conditional Independence Testing In Practice
Authors: Zheng He (University of British Columbia), Roman Pogodin (Google), Yazhe Li (Microsoft), Namrata Deka (Carnegie Mellon University), Arthur Gretton (Google Deepmind / UCL), Danica J. Sutherland (University of British Columbia + Amii)
Conditional independence (CI) tests are central to tasks like causal discovery and fairness evaluation, but they often fail in practice despite theoretical guarantees. Focusing on the Kernel-based Conditional Independence (KCI) test, the work shows that many recent CI tests are special cases of a Generalized Covariance Measure. Practical performance is largely driven by errors in estimating the conditional mean, which affect Type I error, and by the choice of conditioning kernel, which influences test power but can also inflate false positives. These insights clarify why popular CI tests often underperform and highlight how careful kernel and estimation choices are crucial for reliable results.
Projection-based Lyapunov method for fully heterogeneous weakly-coupled MDPs
Authors: Xiangcheng Zhang (Tsinghua), Yige Hong (Carnegie Mellon University), Weina Wang (Computer Science Department, Carnegie Mellon University)
Heterogeneity creates major challenges in large-scale decision-making, especially in weakly-coupled Markov decision processes (WCMDPs) where each subproblem has distinct dynamics. In the fully heterogeneous setting, the authors show that an efficiently computable policy can achieve an O(1/root N) optimality gap in long-run average reward per subproblem as the number of subproblems N grows. This work provides the first asymptotic optimality guarantee for fully heterogeneous average-reward WCMDPs. Key to this result is a novel use of projection-based Lyapunov functions that ensure convergence of rewards and costs even under complete heterogeneity.
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Authors: Hyungjoo Chae (Georgia Institute of Technology), Seonghwan Kim (Yonsei University), Junhee Cho (Yonsei University), Seungone Kim (Carnegie Mellon University), Seungjun Moon (Yonsei University), Gyeom Hwangbo (University of Seoul), Dongha Lim (Korea Advanced Institute of Science & Technology), Minjin Kim (Yonsei University), Yeonjun Hwang (Yonsei University), Minju Gwak (Yonsei University), Dongwook Choi (Chung-Ang University), Minseok Kang (Yonsei University), Gwanhoon Im (Yonsei University), ByeongUng Cho (Yonsei University), Hyojun Kim (Yonsei University), Jun Han (Yonsei University), Taeyoon Kwon (Yonsei University), Minju Kim (Yonsei University), Beong-woo Kwak (Yonsei University), Dongjin Kang (Yonsei University), Jinyoung Yeo (Yonsei University)
Web navigation poses a long-horizon sequential decision-making challenge that goes beyond typical multimodal LLM tasks, but step-level reward models have been lacking. Web-Shepherd, the first process reward model (PRM) for web navigation, evaluates trajectories at each step, enabling both training and test-time assessment. The approach is supported by the WebPRM Collection, a 40K step-level dataset with annotated preference pairs, and WebRewardBench, a benchmark for evaluating PRMs. Experiments show Web-Shepherd outperforms GPT-4o by ~30 points on WebRewardBench and improves policy performance on WebArena-lite by 10.9 points while reducing verification cost by 10×, demonstrating a practical and efficient solution for web navigation tasks.
Fair Cooperation in Mixed-Motive Games via Conflict-Aware Gradient Adjustment
Authors: Woojun Kim (Carnegie Mellon University), Katia Sycara (Carnegie Mellon University)
Mixed-motive multi-agent reinforcement learning requires balancing individual incentives with collective goals, which are often in conflict. The proposed adaptive conflict-aware gradient adjustment method dynamically balances policy gradients from individual and collective objectives, promoting cooperation while preserving fairness in task-specific rewards. Theoretical analysis guarantees monotonic improvement in both collective and individual outcomes, ensuring fairness across agents. Experiments in sequential social dilemma environments show that this approach outperforms baselines in social welfare while maintaining equitable outcomes for all agents.
Poster Papers
Applications
MLZero: A Multi-Agent System for End-to-end Machine Learning Automation
Authors: Haoyang Fang (AWS), Boran Han (AWS), Nick Erickson (Amazon Web Services), Xiyuan Zhang (AWS AI), Su Zhou (Carnegie Mellon University), Anirudh Dagar (AWS), Jiani Zhang (Google), Caner Turkmen (Amazon Web Services), Tony Hu (AWS AI), Huzefa Rangwala (George Mason University), Ying Nian Wu (University of California, Los Angeles), Yuyang (Bernie) Wang (AWS AI), George Karypis (University of Minnesota, Minneapolis)
Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex
Authors: Muquan Yu (Chinese University of Hong Kong), Mu Nan (University of Hong Kong), Hossein Adeli (Columbia University), Jacob Prince (Harvard University), John A. Pyles (University of Washington), Leila Wehbe (Carnegie Mellon University), Maggie Henderson (Carnegie Mellon University), Michael Tarr (Carnegie Mellon University), Andrew Luo (University of Hong Kong)
Topology-Aware Conformal Prediction for Stream Networks
Authors: Jifan Zhang (Northwestern University), Fangxin Wang (University of Illinois at Chicago), Zihe Song (University of Illinois at Chicago), Philip S Yu (UIC), Kaize Ding (Northwestern University), Shixiang Zhu (Carnegie Mellon University)
ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions
Authors: Yue Huang (University of Notre Dame ), Zhengzhe Jiang (Sichuan University), Xiaonan Luo (University of Notre Dame), Kehan Guo (university of notre dame), Haomin Zhuang (University of Notre Dame), Yujun Zhou (University of Notre Dame), Zhengqing Yuan (University of Notre Dame), Xiaoqi Sun (Massachusetts Institute of Technology), Jules Schleinitz (California Institute of Technology), Yanbo Wang (Mohamed bin Zayed University of Artificial Intelligence), Shuhao Zhang (Carnegie Mellon University), Mihir Surve (University of Notre Dame), Nitesh Chawla (University of Notre Dame), Olaf Wiest (University of Notre Dame), Xiangliang Zhang (University of Notre Dame)
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
Authors: Yang Xiao (Hong Kong Polytechnic University), Jiashuo WANG (HKPU), Ruifeng Yuan (Hong Kong Polytechnic University), Chunpu Xu (Hong Kong Polytechnic University), Kaishuai Xu (Hong Kong Polytechnic University), Wenjie Li (The Hong Kong Polytechnic University), Pengfei Liu (Carnegie Mellon University)
Retrieval is Not Enough: Enhancing RAG through Test-Time Critique and Optimization
Authors: Jiaqi Wei (Zhejiang University), Hao Zhou (South China University of Technology), Xiang Zhang (University of British Columbia), Di Zhang (Shanghai Artificial Intelligence Laboratory), Zijie Qiu (Fudan University), Noah Wei (Carnegie Mellon University), Jinzhe Li (Fudan University), Wanli Ouyang (Shanghai AI Lab), Siqi Sun (Fudan University)
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Authors: Ziyang Ma (Shanghai Jiao Tong University), Yinghao Ma (Centre for Digital Music, Queen Mary University of London), Yanqiao Zhu (Shanghai Jiaotong University), Chen Yang (Shanghai Jiaotong University), Yi-Wen Chao (Nanyang Technological University), Ruiyang Xu (Shanghai Jiaotong University), Wenxi Chen (Shanghai Jiaotong University), Yuanzhe Chen (ByteDance Inc.), Zhuo Chen (ByteDance Inc.), Jian Cong (ByteDance Inc.), Kai Li (Tsinghua University, Tsinghua University), Keliang Li (, Chinese Academy of Sciences), Siyou Li (Queen Mary University of London), Xinfeng Li (Nanyang Technological University), Xiquan Li (Shanghai Jiaotong University), Zheng Lian (Institute of automation, Chinese academy of science, Chinese Academy of Sciences), Yuzhe Liang (Shanghai Jiaotong University), Minghao Liu (2077AI), Zhikang Niu (Shanghai Jiaotong University), Tianrui Wang (Tianjin University), Wang Yuping (University of Science and Technology of China), Yuxuan Wang (ByteDance), Yihao Wu (Nanyang Technological University), Guanrou Yang (Shanghai Jiaotong University), Jianwei Yu (Microsoft), Ruibin Yuan (Carnegie Mellon University), Zhisheng Zheng (University of Texas at Austin), Ziya Zhou (Hong Kong University of Science and Technology), Haina Zhu (Shanghai Jiaotong University), Wei Xue (Hong Kong University of Science and Technology), Emmanouil Benetos (Queen Mary University of London), Kai Yu (Shanghai Jiao Tong University), Eng-Siong Chng (Nanyang Technological University), Xie Chen (Shanghai Jiaotong University)
A Generalist Intracortical Motor Decoder
Authors: Joel Ye (Carnegie Mellon University), Fabio Rizzoglio (Northwestern University), Xuan Ma (Northwestern University), Adam Smoulder (CMU, Carnegie Mellon University), Hongwei Mao (University of Pittsburgh), Gary Blumenthal (University of Pittsburgh), William Hockeimer (University of Pittsburgh), Nicolas Kunigk (University of Pittsburgh), Dalton Moore (University of Chicago), Patrick Marino (Phantom Neuro), Raeed Chowdhury (None), J. Patrick Mayo (University of Pittsburgh), Aaron Batista (University of Pittsburgh), Steven Chase (None), Michael Boninger (University of Pittsburgh), Charles Greenspon (University of Chicago), Andrew B Schwartz (University of Pittsburgh), Nicholas Hatsopoulos (University of Chicago), Lee Miller (Northwestern University at Chicago), Kristofer Bouchard (Lawrence Berkeley National Laboratory), Jennifer Collinger (University of Pittsburgh), Leila Wehbe (Carnegie Mellon University), Robert Gaunt (University of Pittsburgh)
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia
Authors: Chandler Smith (Oxford University), Marwa Abdulhai (University of California, Berkeley), Manfred Díaz (Mila, Quebec), Marko Tesic (University of Cambridge), Rakshit Trivedi (Massachusetts Institute of Technology), Sasha Vezhnevets (DeepMind), Lewis Hammond (University of Oxford / Cooperative AI Foundation), Jesse Clifton (Center on Long-Term Risk), Minsuk Chang (Google Deepmind), Edgar Duenez-Guzman (Google DeepMind), John Agapiou (Google DeepMind), Jayd Matyas (DeepMind), Danny Karmon (Google DeepMind), Beining Zhang (University of Southampton ), Jim Dilkes (University of Southampton), Akash Kundu (Heritage Institute of Technology), Hieu Minh Nguyen (Apart Research), Emanuel Tewolde (Carnegie Mellon University), Jebish Purbey (Tribhuvan University), Ram Mohan Rao Kadiyala (), Siddhant Gupta (Indian Institute of Technology, Roorkee), Aliaksei Korshuk (Coframe), Buyantuev Alexander (Higher School of Economics), Ilya Makarov (AIRI & ISP RAS), Gang Zhao (Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University), Rolando Fernandez (University of Texas at Austin), Zhihan Wang (University of Texas at Austin), Caroline Wang (The University of Texas at Austin | Google DeepMind), Jiaxun Cui (Meta), Lingyun Xiao (University of Texas at Austin), Di Shi (University of Texas at Austin), Yoonchang Sung (Nanyang Technological University), Muhammad Arrasy Rahman (The University of Texas at Austin), Peter Stone (The University of Texas at Austin, Sony AI), Yipeng Kang (National Key Laboratory of General Artificial Intelligence), Hyeonggeun Yun (Companoid Labs), Ananya Ananya (Stanford University), Taehun Cha (Korea University), Zhiqiang Wu (Tongji University), Elizaveta Tennant (University College London), Olivia Macmillan-Scott (UCL), Marta Segura (University College London, Uni
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み