ニューラルアルゴリズム推論
The Gradient の記事は、現代の深層学習が抱える不確実性や一般化の欠如という課題に対し、古典的アルゴリズムの「証明可能性」「汎用性」「解釈可能性」という強みをニューラルネットワークに組み込む必要性を説いている。
キーポイント
古典的アルゴリズムの優位性
最短経路探索やソートなどの古典的計算は、証明可能な正しさ、リソース保証、そして入力規模や分布が変わっても機能する強力な汎用性を備えている。
深層学習の限界との対比
現代のディープニューラルネットワークは、精度の保証が難しく、分布外入力に対して脆弱であり、ブラックボックス化して解釈や組み合わせ(構成性)が困難であるという欠点がある。
人間にとって有用な AI の要件
AI が人間に概念を教えるなど実用的になるためには、入力の詳細に依存しない出力品質と、新たな状況への一般化能力が不可欠であり、これは古典的アルゴリズムの特性そのものである。
計算のニューラルネットワークによる捉え方
本記事では、これらの古典的な計算プロセスをどのようにして深層ニューラルネットワークでキャプチャ(模倣・統合)するかという研究課題への導入を行っている。
AI の有用性と一般化能力の重要性
AI が人間に対して説得力を持って概念を教えるためには、入力の一貫した詳細に依存せず、新しい状況にもその概念を一般化する能力が不可欠である。
古典的アルゴリズムの実行学習のベンチマーク価値
ニューラルネットワークが古典的アルゴリズムを実行できるかを調べることは、無限のデータソースと明確な目標関数を持つため、モデルの評価や解釈性分析に最適な環境を提供する。
アルゴリズム整合性と一般化性能の関係
「What Can Neural Networks Reason About?」という論文は、アーキテクチャが特定のタスクに対してより良いアルゴリズム的整合性を持つほど、サンプル効率(必要な学習データ数)が高まり、結果として一般化能力も向上することを数学的に示した。
影響分析・編集コメントを表示
影響分析
この記事は、単なる技術解説を超え、現在の AI ブームにおける根本的な欠陥(ブラックボックス化と一般化の失敗)を指摘し、古典的計算理論への回帰や融合の重要性を提唱しています。これは、将来的に信頼性が高く説明可能な AI システムを構築するための重要な研究指針となり、特に安全性が求められる分野でのニューラルネットワークの応用可能性を広げる示唆に富む内容です。
編集コメント
深層学習の「黒箱」問題に対する古典的アルゴリズムの視点からのアプローチは、信頼性 AI の文脈で非常に重要な議論です。技術的な詳細よりも、なぜその特性が AI に必要なのかという本質的な問いかけが含まれており、研究者にとって示唆に富む内容です。
しかし、私の問いは、これらの前提が真であることを何ら要求するものではありません。第一に、現実世界はしばしば信じられないほどノイズが多く動的であり、そのような抽象化された入力を提供することはほとんどありません。例えば、タスクは現実世界の交通ネットワーク内の2地点間を移動する最適な方法に関わる可能性があり、これはリアルタイムの交通速度を推定するためにノイズの多い複雑なデータを処理することを必要とする、難しい経路探索問題です――これは私がGoogle Mapsにグラフニューラルネットワーク(GNN)を導入する作業をした際に、個人的に学んだ教訓です[29]。第二に、なぜ「最適」が最短でなければならないのでしょうか?交通経路探索の文脈では、具体的な状況や目標に応じて、「最適」は最も費用対効果が高い、最も汚染が少ないなどを意味する可能性が十分にあります。これらのすべてのバリエーションは、ダイクストラ法の直接的な適用を困難にし、実際には複数のアルゴリズムの組み合わせを必要とするかもしれません。
これらの両方の問題は、複雑な現実世界のデータと、対象アルゴリズムを実行するのに適した入力との間で、しばしば困難なマッピングを行う必要があることを浮き彫りにしています。歴史的に、このマッピングは人間によって、手動で、または専門的なヒューリスティクスを介して行われてきました。これは当然、次の疑問を呼び起こします:人間は、いつの日か
原文を表示
imageIn this article, we will talk about classical computation: the kind of computation typically found in an undergraduate Computer Science course on Algorithms and Data Structures [1]. Think shortest path-finding, sorting, clever ways to break problems down into simpler problems, incredible ways to organise data for efficient retrieval and updates. Of course, given The Gradient’s focus on Artificial Intelligence, we will not stop there; we will also investigate how to capture such computation with deep neural networks.
Why capture classical computation?
Maybe it’s worth starting by clarifying where my vested interest in classical computation comes from. Competitive programming—the art of solving problems by rapidly writing programs that need to terminate in a given amount of time, and within certain memory constraints—was a highly popular activity in my secondary school. For me, it was truly the gateway into Computer Science, and I trust the story is similar for many machine learning practitioners and researchers today. I have been able to win several medals at international programming competitions, such as the Northwestern Europe Regionals of the ACM-ICPC, the top-tier Computer Science competition for university students. Hopefully, my successes in competitive programming also give me some credentials to write about this topic.
While this should make clear why I care about classical computation, why should we all care? To arrive at this answer, let us ponder some of the key properties that classical algorithms have:
They are provably correct, and we can often have strong guarantees about the resources (time or memory) required for the computation to terminate.
They offer strong generalisation: while algorithms are often devised by observing several small-scale example inputs, once implemented, they will work without fault on inputs that are significantly larger, or distributionally different than such examples.
By design, they are interpretable and compositional: their (pseudo)code representation makes it much easier to reason about what the computation is actually doing, and one can easily recompose various computations together through subroutines to achieve different capabilities.
Looking at all of these properties taken together, they seem to be exactly the issues that plague modern deep neural networks the most: you can rarely guarantee their accuracy, they often collapse on out-of-distribution inputs, and they are very notorious as black boxes, with compounding errors that can hinder compositionality.
However, it is exactly those skills that are important for making AI instructive and useful to humans! For example, to have an AI system that reliably and instructively teaches a concept to a human, the quality of its output should not depend on minor details of the input, and it should be able to generalise that concept to novel situations. Arguably, these skills are also a missing key step on the road to building generally-intelligent agents. Therefore, if we are able to make any strides towards capturing traits of classical computation in deep neural networks, this is likely to be a very fruitful pursuit.
First impressions: Algorithmic alignment
My journey into the field started with an interesting, but seemingly inconsequential question:
Can neural networks learn to execute classical algorithms?
This can be seen as a good way to benchmark to what extent are certain neural networks capable of behaving algorithmically; arguably, if a system can produce the outputs of a certain computation, it has “captured” it. Further, learning to execute provides a uniquely well-built environment for evaluating machine learning models:
An infinite data source—as we can generate arbitrary amounts of inputs;
They require complex data manipulation—making it a challenging task for deep learning;
They have a clearly specified target function—simplifying interpretability analyses.
When we started studying this area in 2019, we really did not think more of it than a very neat benchmark—but it was certainly becoming a very lively area. Concurrently with our efforts, a team from MIT tried tackling a more ambitious, but strongly related problem:
What makes a neural network better (or worse) at fitting certain (algorithmic) tasks?
The landmark paper, “What Can Neural Networks Reason About?” [2] established a mathematical foundation for why an architecture might be better for a task (in terms of sample complexity: the number of training examples needed to reduce validation loss below epsilon). The authors’ main theorem states that better algorithmic alignment leads to better generalisation. Rigorously defining algorithmic alignment is out of scope of this text, but it can be very intuitively visualised:
Here, we can see a visual decomposition of how a graph neural network (GNN) [3] aligns with the classical Bellman-Ford [4] algorithm for shortest path-finding. Specifically, Bellman-Ford maintains its current estimate of how far away each node is from the source node: a distance variable (du) for each node u in a graph. At every step, for every neighbour v of node u, an update to du is proposed: a combination of (optimally) reaching v, and then traversing the edge connecting v and u (du + wvu). The distance variable is then updated as the optimal out of all the proposals. Operations of a graph neural network can naturally decompose to follow the data flow of Bellman-Ford:
The distance variables correspond to the node features maintained by the GNN;
Adding the edge distance to a distance variable corresponds to computing the GNN’s message function;
Choosing the optimal neighbour based on this measure corresponds to the GNN’s permutation-invariant aggregation function, ⨁.
Generally, it can be proven that, the more closely we can decompose our neural network to follow the algorithm’s structure, the more favourable sample complexity we can expect when learning to execute such algorithms. Bellman-Ford is a typical instance of a dynamic programming (DP) algorithm [5], a general-purpose problem-solving strategy that breaks the problem down into subproblems, and then recombines their solutions to find the final solution.
The MIT team made the important observation that GNNs appear to algorithmically align with DP, and since DP can itself be used to express many useful forms of classical computation, GNNs should be a very potent general-purpose model for learning to execute. This was validated by several carefully constructed DP execution benchmarks, where relational models like GNNs clearly outperformed architectures with weaker inductive biases. GNNs have been a long-standing interest of mine [6], so the time was right to release our own contribution to the area:
Neural Execution of Graph Algorithms
In this paper [7], concurrently released with Xu et al. [2], we conducted a thorough empirical analysis of learning to execute with GNNs. We found that while algorithmic alignment is indeed a powerful tool for model class selection, it unfortunately does not allow us to be reckless. Namely, we cannot just apply any expressive GNN to an algorithmic execution task and expect great results, especially out-of-distribution—which we previously identified as a key setting in which “true” reasoning systems should perform well. Much like other neural networks, GNNs can very easily overfit to the characteristics of the distribution of training inputs, learning “clever hacks” and sidestepping the actual procedure that they are attempting to execute.
We hence identify three key observations on careful inductive biases to use, to improve the algorithmic alignment to certain path-finding problems even further and allow for generalising to 5x larger inputs at test time:
Most traditional deep learning setups involve a stack of layers with unshared parameters. This fundamentally limits the amount of computation the neural network can perform: if, at test time, an input much larger than the ones in the training data arrives, it would be expected that more computational steps are needed—yet, the unshared GNN has no way to support that. To address this, we adopt the encode-process-decode paradigm [8]: a single shared processor GNN is iterated for many steps, and this number of steps can be variable at both training and inference time. Such an architecture also allows a neat way to algorithmically align to iterative computation, as most algorithms involve repeatedly applying a certain computation until convergence.
Since most path-finding algorithms (including Bellman-Ford) require “local” optimisation (i.e. choosing exactly one optimal neighbour at every step), we opted to use max aggregation to combine the messages sent in GNNs. While this may seem like a very intuitive idea, it went strongly against the folklore of the times, as max-GNNs were known to be theoretically inferior to sum-GNNs at distinguishing non-isomorphic graphs [9]. (We now have solid theoretical evidence [10] for why this is a good idea.)
Lastly, while most deep learning setups only require producing an output given an input, we found that this misses out on a wealth of ways to instruct the model to align to the algorithm. For example, there are many interesting invariants that algorithms have that can be explicitly taught to a GNN. In the case of Bellman-Ford, after k iterations are executed, we should be expected to be able to recover all shortest paths that are no more than k hops away from the source node. Accordingly, we use this insight to provide step-wise supervision to the GNN at every step. This idea appears to be gaining traction in Large Language Model design in recent months [11, 12].
All three of the above changes make for stronger algorithmically-aligned GNNs.
Playing the alignment game
It must be stressed that the intuitive idea of algorithmic alignment—taking inspiration from Computer Science concepts in architecture design—is by no means novel! The text-book example of this are the neural Turing machine (NTM) and differentiable neural computer (DNC) [13, 14]. These works are decisively influential; in fact, in its attempt to make random-access memory compatible with gradient-based optimisation, the NTM gave us one of the earliest forms of content-based attention, three years before Transformers [15]!
However, despite their influence, these architectures are nowadays virtually never used in practice. There are many possible causes for this, but in my opinion it is because their design was too brittle: trying to introduce many differentiable components into the same architecture at once, without a clear guidance on how to compose them, or a way to easily debug them once they failed to show useful signal on a new task. Our line of work still wants to build something like an NTM—but make it more successfully deployed, by using algorithmic alignment to more carefully prototype each building block in isolation, and have a more granular view of which blocks benefit execution of which target algorithms.
Our approach of “playing the algorithmic alignment game” appears to have yielded a successful line of specialised (G)NNs, and we now have worthy ‘fine-grained’ solutions for learning to execute linearithmic sequence algorithms [16], iterative algorithms [17], pointer-based data structures [18], as well as persistent auxiliary memory [19]. Eventually, these insights also carried over to more fine-grained theory as well. In light of our NEGA paper [7], the inventors of algorithmic alignment refined their theory into what is now known as linear algorithmic alignment [10], providing justification for, among other things, our use of the max aggregation. More recent insights show that understanding algorithmic alignment may require causal reasoning [20,21], properly formalising it may require category theory [22], and properly describing it may require analysing asynchronous computation [23]. Algorithmic alignment is therefore turning into a very exciting area for mathematical approaches to deep learning in recent years.
Why not just run the target algorithm?...and rebuttals
While it appears a lot of useful progress has been made towards addressing our initial “toy question”, the idea of learning to execute is not one that easily passes peer review. My personal favourite reviewer comment I received—quoted in full—was as follows: “This paper will certainly accelerate research in building algorithmic models, and there’s certainly a lot of researchers that would make advantage of it, I am just not sure that this research should even exist”.
This is clearly not the nicest thing to receive as the first author of a paper. But still, let’s try to put my ego aside, and see what can be taken away from such reviews. There is a clear sense in which such reviews are raising an entirely valid point: tautologically, the target algorithm will execute itself better (or equally good) than any GNN we’d ever learn over it. Clearly, if we want wide-spread recognition of these ideas, we need to show how learning to execute can be usefully applied beyond the context of “pure” execution.
Our exploration led us to many possible ideas, and in the remainder of this article, I will show three ideas that saw the most impact.
First, algorithmically aligned models can accelerate science. And if you want clear evidence of this, look no further than the cover of Nature [24]. In our work, we train (G)NNs to fit a mathematical dataset of interest to a mathematician, and then use simple gradient saliency methods to signal to the mathematician which parts of the inputs to focus their attention at. While such signals are often remarkably noisy, they do allow a mathematician to study only the most salient 20-30 nodes in a graph that would otherwise have hundreds or thousands of nodes, making pattern discovery much easier. The discovered patterns can then form the basis of novel theorems, and/or be used to derive conjectures.
With this simple approach, we were able to drive independent contributions to two highly disparate areas of math: knot theory [25] and representation theory [26], both subsequently published in their areas’ respective top-tier journals, hence earning us the Nature accolade. But, while our approach is simple in principle, a question arose especially in the representation theory branch: which (G)NN to use? Standard expressive GNNs did not yield clearly interpretable results.
This is where algorithmic alignment helped us: Geordie Williamson, our representation theory collaborator, provided an algorithm that would compute the outputs we care about, if we had access to privileged information. We achieved our best results with a GNN model that was explicitly aligning its components to this target algorithm.
More generally: in this case, a target algorithm existed, but executing it was inapplicable (due to privileged inputs). Algorithmic alignment allowed us to embed “priors” from it anyway.
Second, algorithmically aligned models are fast heuristics. In recent work with computer networking and machine learning collaborators from ETH Zürich, we studied the applicability of neural algorithmic reasoning in computer networking [27]. Specifically, we sought to expedite the challenging task of configuration synthesis: based on a given specification of constraints a computer network should satisfy, produce a corresponding network configuration (a graph specifying the routers in a network and their connections). This configuration must satisfy all of the specifications, once a network protocol has been executed over it.
Producing these configurations is known to be a very challenging NP-hard problem—in practice, it is usually solved with slow SMT solvers, which can often require doubly-exponential complexity. Instead, we choose to use ideas from algorithmic reasoning to generate configurations by inverting the protocol (which can be seen as a graph algorithm). Specifically, we generate many random network configurations, execute the protocol over them, and collect all true facts about the resulting network to extract corresponding specifications. This gives us all we need to generate a graph-based dataset from specifications to configurations, and fit an algorithmically-aligned GNN to this dataset.
Naturally, by virtue of just requiring a forward pass of a machine learning model, this approach is substantially faster than SMT solvers at inference time: for certain configurations, we have observed over 490x speedup over the prior state-of-the-art. Of course, there is no free lunch: the price we pay for this speedup are occasional inaccuracies in the produced configurations at test-time. That being said, on all the relevant distributions we evaluated, the average number of constraints satisfied has consistently been over 90%, which makes our method already applicable for downstream human-in-the-loop use—and it is likely to accelerate human designers, as very often, the initial configurations are unsatisfiable, meaning SMT solvers will spend a lot of effort only to say a satisfying configuration cannot be found. During this time, a fast forward pass of a GNN could have allowed for far more rapid iteration.
More generally: in this case, a target algorithm is only being approximated, but such that it provides a fast and reasonably accurate heuristic, enabling rapid human-in-the-loop iteration.
A core problem with applying classical algorithms
To set us up for the third and final idea, let me pose a motivating task: “Find me the optimal path from A to B”. How do you respond to this prompt?
Chances are, especially if you come from a theoretical Computer Science background like me, that you will respond to this question in a very singular way. Namely, you will subtly assume that I am providing you with a weighted graph and asking you for the shortest path between two specific vertices in this graph. We can then diligently apply our favourite path-finding algorithm (e.g. Dijkstra’s algorithm [28]) to resolve this query. I should highlight that, at least at the time of writing this text, the situation is not very different with most of today’s state-of-the-art AI chatbots—when prompted with the above task, while they will often seek further information, they will promptly assume that there is an input weighted graph provided!
However, there’s nothing in my question that requires either of these assumptions to be true. Firstly, the real-world is often incredibly noisy and dynamic, and rarely provides such abstractified inputs. For example, the task might concern the optimal way to travel between two places in a real-world transportation network, which is a challenging routing problem that relies on processing noisy, complex data to estimate real-time traffic speeds—a lesson I’ve personally learnt, when I worked on deploying GNNs within Google Maps [29]. Secondly, why must “optimal” equal shortest? In the context of routing traffic, and depending on the specific contexts and goals, “optimal” may well mean most cost-efficient, least polluting, etc. All of these variations make a straightforward application of Dijkstra’s algorithm difficult, and may in practice require a combination of several algorithms.
Both of these issues highlight that we often need to make a challenging mapping between complex real-world data and an input that will be appropriate for running a target algorithm. Historically, this mapping is performed by humans, either manually or via specialised heuristics. This naturally invites the following question: Can humans ever hope to
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み