Jiakang Huang’s Website

PyTorch Inductor 中 speedup_by_fusion 深度解析

2026-03-31T00:00:00+00:00

作者： Jiakang Huang

上篇文章我们分析了 Inductor 中 fuse_nodes 的整体架构和工作流程（详见：PyTorch Inductor 中 fuse_nodes 融合流程深度解析）。本篇我们将聚焦其中一个有趣的配置项 speedup_by_fusion，从开启方式、运行机制、实际日志到局限性展开讨论。

1. 如何开启 speedup_by_fusion

speedup_by_fusion 是 torch._inductor.config 中的一个配置项。开启后，Inductor 在融合决策阶段会通过实际 benchmark 来判断两个算子融合后是否真的更快，而不仅仅依赖启发式打分。

可以通过以下方式开启：

import torch._inductor.config as config
config.benchmark_fusion = True

或者通过 torch.compile 的 options 参数传入：

compiled_model = torch.compile(model, backend="inductor", options={"benchmark_fusion": True})

2. 开启后做了什么

在默认模式下，Inductor 的融合决策完全基于启发式规则——通过 can_fuse 检查合法性，通过 score_fusion 打分排序，然后贪心地执行融合。

开启 benchmark_fusion 后，流程增加了一个关键步骤：对候选融合对进行实际 GPU benchmark。具体来说，系统会分别计时：

两个算子独立运行的总耗时
两个算子融合后作为一个 kernel 的耗时

只有当融合后确实更快时，才执行该融合。

3. 日志中的 Speedup 示例

开启后，在 Inductor 的 fusion 日志中可以看到类似如下的输出：

V0312 02:40:20.816000 3795204 scheduler.py:4396] [0/0] [__fusion]
  can fuse (benchmark): fusing OrderedSet(['buf17']) with OrderedSet(['buf18'])
  cause 2.462x speedup

这条日志表明 buf17 和 buf18 经过实际 benchmark 测试后，融合带来了 2.462 倍的加速，因此决定执行融合。

4. 局限性与 Register Spilling 问题

开启 speedup_by_fusion 虽然看起来更加”科学”，但实际使用中存在两个值得讨论的问题。

4.1 贪心融合的全局最优性问题

benchmark 测试的是两个算子融合前后的性能对比。但这个局部最优并不一定意味着全图在 GPU 上运行时也是最优的。贪心算法的固有缺陷在于：局部最优决策的累积不一定导向全局最优。

4.2 Register Spilling 导致的融合拒绝

在实际 benchmark 过程中，可能出现融合后的 kernel 因为 register spilling 而被拒绝融合的情况。日志示例如下：

V0312 02:40:31.500000 3795204 scheduler.py:1776] [0/0] [__fusion]
  cannot fuse op1_op6_op11_op2_op7_op12 with op16_op17_op18:
  register spilling of the fused kernel

什么是 Register Spilling？ GPU 的每个线程有有限数量的寄存器。当一个 kernel 需要的寄存器数量超出硬件限制时，多余的变量会被”溢出”到较慢的 local memory 中。这就是 register spilling。它会导致显著的性能下降，因为 local memory 的访问延迟远高于寄存器访问。

当前实现中，一旦检测到 register spilling，就直接拒绝该融合，不再进一步评估。这带来了一个重要疑问：

即使发生了 register spilling，融合带来的 launch overhead 减少是否有可能超过 spilling 的性能损失？

换句话说，当前的实现可能因为 register spilling 而过于保守地拒绝了一些实际上有益的融合。

5. 实验数据

为验证上述假设，我在 RTX 5090 上基于一个合成 workload 做了对比实验。实验环境为 PyTorch 2.10.0+cu128。

benchmark_fusion_0：关闭 benchmark_fusion（纯启发式）
benchmark_fusion_1：开启 benchmark_fusion

模型 20：HubConflictRoundOpt

该模型具有共享 hub tensor 和多分支竞争结构，包含多种 reduction 和 transcendental 运算。

指标	关闭 (fusion_0)	开启 (fusion_1)	变化
编译后运行时间 (ms)	0.817	0.964	+17.9% (变慢)
Eager 运行时间 (ms)	79.36	60.10	-24.3%
编译加速比 vs Eager	97.1x	62.4x	-35.8%
FX 编译耗时 (s)	7.22	20.00	+176.8%
融合轮数	3	2	-1
节点缩减数	67	62	-5
Benchmark 决策次数	0	62	+62

数据分析

查看该 workload 的完整日志后可以确认：所有少融合的节点，都是因为开启 benchmark 后检测到 register spilling 而被拒绝的。

在这个模型上，开启 benchmark_fusion 后，融合轮数减少、节点缩减数减少，最终编译后运行时间反而变慢了 17.9%。这说明在这个 workload 中，因 register spilling 而少融合节点所带来的额外 launch overhead，很可能比融合后可能出现的 spilling 成本更大。

更值得注意的是，开启 benchmark 后 FX 编译时间从 7.22s 增加到 20.00s，增幅约 176.8%，因为每个候选对都需要实际在 GPU 上跑一遍。

6. 思考

使用真实 benchmark 来决定两个节点是否应该融合，这无疑是一个聪明的做法——它直接用数据说话，避免了启发式规则可能的误判。

但当前对 register spilling 的处理方式过于简单粗暴：一旦检测到 spilling，直接拒绝融合，不再进行 benchmark 评估。 即使只看这一个 workload，这种策略也可能过于保守。

个人认为，即使出现了 register spilling，也应该继续运行 benchmark，让实际的运行数据来决定是否融合。毕竟 register spilling 的影响程度取决于溢出量和访问模式，并非所有 spilling 都会导致不可接受的性能下降。

当然，我对 benchmark 的具体实现方式了解有限，也许存在更好的方法来判断融合前后的性能差异。欢迎大家通过邮件与我讨论。

Deep Dive into speedup_by_fusion in PyTorch Inductor

2026-03-31T00:00:00+00:00

Author: Jiakang Huang

In the previous post, we walked through the overall architecture of fuse_nodes in PyTorch Inductor (see: Deep Dive into fuse_nodes in PyTorch Inductor). Today we zoom in on a particularly interesting configuration within the fusion pipeline: speedup_by_fusion (exposed as benchmark_fusion). We will cover how to enable it, what it does under the hood, what the logs look like, and a critical limitation around register spilling that may lead to suboptimal fusion decisions.

1. Enabling benchmark_fusion

benchmark_fusion is a config flag in torch._inductor.config. When turned on, Inductor uses actual GPU benchmarks—rather than heuristics alone—to decide whether fusing two operators is worthwhile.

You can enable it in two ways:

import torch._inductor.config as config
config.benchmark_fusion = True

Or via the options dict passed to torch.compile:

compiled_model = torch.compile(model, backend="inductor", options={"benchmark_fusion": True})

2. What Happens When It Is Enabled

In default mode, Inductor’s fusion decisions are purely heuristic: can_fuse checks legality, score_fusion ranks candidates, and fusions are applied greedily.

With benchmark_fusion enabled, an additional step is inserted: each candidate fusion pair is actually benchmarked on the GPU. The system times:

The separate execution of both operators
The fused execution as a single kernel

A fusion is only committed if the fused kernel is measurably faster.

3. What the Logs Look Like

With benchmark fusion enabled, the Inductor fusion log emits entries like:

V0312 02:40:20.816000 3795204 scheduler.py:4396] [0/0] [__fusion]
  can fuse (benchmark): fusing OrderedSet(['buf17']) with OrderedSet(['buf18'])
  cause 2.462x speedup

This tells us that buf17 and buf18 were actually benchmarked, and the fused kernel ran 2.462x faster, so the fusion was accepted.

4. Limitations and the Register Spilling Problem

While benchmark-driven fusion sounds strictly better than heuristics, there are two issues worth examining.

4.1 Greedy Fusion Is Not Globally Optimal

The benchmark evaluates a single pair of operators in isolation. Even if fusing A and B is locally faster, it does not guarantee that the resulting full graph is globally optimal. This is an inherent limitation of greedy algorithms: a sequence of locally optimal decisions may not compose into a globally optimal solution.

4.2 Register Spilling Causes Premature Rejection

During benchmarking, the fused kernel may trigger register spilling, at which point the current implementation immediately rejects the fusion without measuring the actual performance impact. Here is an example from the logs:

V0312 02:40:31.500000 3795204 scheduler.py:1776] [0/0] [__fusion]
  cannot fuse op1_op6_op11_op2_op7_op12 with op16_op17_op18:
  register spilling of the fused kernel

What is register spilling? Each GPU thread has a limited number of registers. When a kernel requires more registers than the hardware provides per thread, the excess variables are “spilled” to local memory, which resides in much slower off-chip storage. This increases memory traffic and can degrade performance significantly.

The current implementation treats register spilling as a hard rejection signal. But this raises an important question:

Could the reduction in kernel launch overhead from fusion outweigh the performance cost of register spilling?

In other words, the current policy may be too conservative, rejecting fusions that would still be net beneficial despite some spilling.

5. Experimental Results

To investigate, I ran a controlled experiment on an RTX 5090 with PyTorch 2.10.0+cu128, comparing two settings on one synthetic workload:

benchmark_fusion_0: benchmark fusion off (heuristic-only)
benchmark_fusion_1: benchmark fusion on

Model 20: HubConflictRoundOpt

A synthetic model with a shared hub tensor feeding six competing branches, mixing reductions across different axes and transcendental operations (tanh, sin*cos, relu).

Metric	Off (fusion_0)	On (fusion_1)	Change
Compiled runtime (ms)	0.817	0.964	+17.9% (slower)
Eager runtime (ms)	79.36	60.10	-24.3%
Compiled speedup vs Eager	97.1x	62.4x	-35.8%
FX compile time (s)	7.22	20.00	+176.8%
Fusion rounds	3	2	-1
Net node reduction	67	62	-5
Benchmark decisions	0	62	+62

Analysis

After reviewing the full logs for this workload, I can confirm that every fusion rejected in the benchmark-on run was rejected due to register spilling—not because the benchmark showed a slowdown.

For this model, turning on benchmark_fusion reduced the number of fusion rounds, reduced net node elimination, and made the compiled runtime 17.9% slower. That pattern suggests that the extra launch overhead from keeping more kernels separate (due to spilling-based rejections) outweighed the cost of the spilling that the fused kernels might have incurred.

The compile-time cost is also substantial: FX compile time increased from 7.22s to 20.00s (+176.8%), since each candidate pair has to be compiled and profiled on the GPU.

6. Discussion

Using real benchmarks to validate fusion decisions is a smart idea—it replaces speculation with measurement. However, the current handling of register spilling is arguably too blunt: spilling is treated as a hard veto, bypassing the benchmark entirely.

This single workload already suggests that the policy may be overly conservative. A more nuanced approach would be to let the benchmark run even when spilling is detected, and let the actual timing data determine whether the fusion is worthwhile. After all, the severity of register spilling depends heavily on the amount of spilling and memory access patterns—not all spilling leads to unacceptable performance degradation.

I am not fully familiar with the internals of the benchmark implementation, and there may well be better ways to evaluate pre- and post-fusion performance. If you have thoughts or ideas on this topic, I would love to hear from you—feel free to reach out by email.

PyTorch Inductor 中 fuse_nodes 融合流程深度解析

2026-03-29T00:00:00+00:00

作者： Jiakang Huang，Xueyan Zhang

总览

下图展示了 fuse_nodes 的完整调用链。整个过程可以概括为一句话：在节点图上反复寻找可融合的节点对，按优先级打分排序，然后依次尝试真正的融合，直到图不再缩小为止。

fuse_nodes(nodes)
│
└─► fuse_nodes_once()  ×最多10轮，节点数不变或=1时提前停止
    │
    ├─ 1. get_possible_fusions()  ─────────────────── 枚举所有候选融合对
    │   │
    │   ├─ [Loop 1] 按 buffer_name 分组
    │   │   对每个 fusable node，按其读写的 buffer 归入 dict
    │   │
    │   ├─ [Loop 2] 在每个 buffer 组内 check_all_pairs()
    │   │   │  ► 窗口优化：只看前后各64个邻居 → O(64n) 而非 O(n²)
    │   │   │
    │   │   └─► can_fuse(n1, n2)  ──────────────────── 8大类门控检查
    │   │       │  ① 自身判等           ⑤ 顺序/拓扑依赖
    │   │       │  ② 特殊节点拦截       ⑥ 数据类型兼容
    │   │       │  ③ Template快速放行   ⑦ 内存/尺寸约束
    │   │       │  ④ Grouped节点禁入    ⑧ 其他后端限制
    │   │       │
    │   │       └─ 若失败 & node2 是 template/foreach
    │   │          → 反转方向再试 can_fuse(n2, n1)
    │   │            (容器节点可以"吸收"其他节点)
    │   │
    │   ├─ [Loop 3] aggressive_fusion 模式
    │   │   按 node.group 再分一次组，组内再 check_all_pairs()
    │   │
    │   └─► get_possible_fusions_with_highest_priority() ── 去重 & 选优
    │       │
    │       ├─ get_backend(device).get_fusion_pair_priority(n1, n2)
    │       │   后端接口：CPU/CUDA 各自决定融合方式的优先级
    │       │
    │       └─ 同一 pair 可能来自不同分组路径 → 只保留最高优先级的那条
    │
    ├─ 2. score_fusion_key()  ─────────────────────── 对候选对打分排序
    │   │
    │   └─► V.choices.score_fusion()
    │       基于三个维度：
    │         • 融合类型 (template / reduction / ...)
    │         • 预估节省的内存带宽
    │         • 原始图中的拓扑距离（越近越优先）
    │
    └─ 3. _try_fusion_pairs()  ────────────────────── 按排序顺序逐对尝试融合
        排序至关重要：若先融合 (A,B)，则 (B,C) 自动作废

阶段一：寻找候选对 - `get_possible_fusions`

这一步的目标是从整张图中筛出所有“有可能且有价值”被融合的节点对。

Buffer 分组 - 融合的前提

代码首先遍历所有 fusable node，按照节点读写的 buffer_name 建立一个分组字典。背后的直觉很简单：如果两个节点不共享任何 buffer，融合它们大概率没有收益，既不能省掉中间 buffer 的分配，也不能减少内存搬运。因此只在同一个 buffer 组内部做配对检查。

窗口优化 - 控制搜索空间

在每个 buffer 组内调用 check_all_pairs 做两两配对。这里有一个关键优化：PyTorch 默认只在节点列表的前后各 64 个邻居之间检查。对于长度为 n 的节点列表，候选对数量上界是 64 * n，而非朴素的 n^2。这让融合搜索在大型图上依然可控。

`can_fuse` - 8 大类门控

每一对候选都必须通过 can_fuse(node1, node2) 的严格审查。检查项至少包括：

自身判等：node1 == node2，直接跳过。
特殊节点拦截：FusedMixOrderReductions 等已融合节点不允许再次融合。
Template 快速放行：template 节点有专门的短路判定通道。
Grouped 节点禁入：GroupedSchedulerNode 已被分组调度，不再参与融合。
顺序依赖检查：最重要的一项，确保融合不会打破数据流的拓扑顺序。
以及数据类型兼容、内存和尺寸约束、后端限制等更多细粒度校验。

一个有趣的细节：如果 can_fuse(n1, n2) 判定失败，但 n2 是 template 或 foreach 节点，代码会反转方向再试一次 can_fuse(n2, n1)。原因在于 template 和 foreach 本质上是“容器节点”，它们可以把别的节点“吸收”进来，所以方向不同，融合语义也不同。

激进模式

当 config.aggressive_fusion 开启时，代码会额外按 node.group 再做一轮分组。调度器认为同一 group 内的节点属于同一个更大的逻辑单元，值得更积极地尝试融合。

阶段二：去重与打分

去重 - `get_possible_fusions_with_highest_priority`

同一对 (node1, node2) 可能从不同的分组路径被重复选出，一次来自 buffer 组，一次来自 node group。不同路径意味着不同的融合方式，而我们只需要保留最优的那一种。

去重的核心依据来自后端接口 get_backend(device).get_fusion_pair_priority(node1, node2)。这是一个动态分派调用，先根据 device 找到对应的后端，例如 CPU 或 CUDA，再调用该后端自己的优先级评估逻辑。基类默认返回 0，但各后端可以覆写。

打分 - `score_fusion_key`

去重后的候选对会经过 V.choices.score_fusion() 打分。打分维度包括：

融合类型：template 融合、reduction 融合等不同类型权重不同。
预估节省的内存带宽：融合后能少搬多少数据，这是最核心的收益指标。
原始图中的拓扑距离：距离越近的节点对越优先融合。

所有候选对按分数从高到低排序，排序结果直接决定融合的先后顺序。

阶段三：尝试融合 - `_try_fusion_pairs`

这是真正执行融合的地方。排序至关重要：候选对按分数从高到低依次尝试，一旦某个节点已被融合，包含该节点的其他候选对自动作废。

举例来说，假设候选列表中有 (A, B) 和 (B, C)，且 (A, B) 分数更高。那么 (A, B) 会先被融合，之后 (B, C) 就不再可行，因为 B 已经消失在融合节点 AB 中了。

这种贪心策略加上前面精心设计的打分函数，使得 Inductor 能在合理的时间内找到一个高质量的融合方案。

小结

fuse_nodes 的设计体现了几个工程上的权衡：

窗口优化把搜索复杂度从 O(n^2) 压到接近 O(n) 的实践表现，让大型图也可行。
多路分组通过 buffer 组、node group 和 aggressive 模式，在不同粒度上捕捉融合机会。
后端分派让 CPU 和 CUDA 可以各自定义融合偏好。
贪心排序用一个简单但有效的策略，在候选对之间做出取舍。

整体来看，这是一个“宽搜索到窄筛选到贪心决策”的经典优化流程。

Deep Dive into fuse_nodes in PyTorch Inductor

2026-03-29T00:00:00+00:00

Authors: Jiakang Huang, Xueyan Zhang

Overview

The diagram below shows the full call chain of fuse_nodes. The entire process boils down to one sentence: repeatedly find fusable node pairs in the graph, score and rank them by priority, then greedily attempt the actual fusions until the graph stops shrinking.

fuse_nodes(nodes)
│
└─► fuse_nodes_once()  ×up to 10 rounds; early exit if size unchanged or =1
    │
    ├─ 1. get_possible_fusions()  ─────────────────── enumerate candidate pairs
    │   │
    │   ├─ [Loop 1] Group nodes by buffer_name
    │   │   For each fusable node, bucket it by the buffers it reads/writes
    │   │
    │   ├─ [Loop 2] check_all_pairs() within each buffer group
    │   │   │  ► Window optimization: only check ±64 neighbors → O(64n) not O(n²)
    │   │   │
    │   │   └─► can_fuse(n1, n2)  ──────────────────── 8 categories of gate checks
    │   │       │  ① Identity check          ⑤ Topological dependency
    │   │       │  ② Special node block       ⑥ Dtype compatibility
    │   │       │  ③ Template fast-path       ⑦ Memory / size constraints
    │   │       │  ④ Grouped node ban         ⑧ Other backend limits
    │   │       │
    │   │       └─ If failed & node2 is template/foreach
    │   │          → retry reversed: can_fuse(n2, n1)
    │   │            (container nodes can "absorb" other nodes)
    │   │
    │   ├─ [Loop 3] aggressive_fusion mode
    │   │   Re-group by node.group, then check_all_pairs() within each group
    │   │
    │   └─► get_possible_fusions_with_highest_priority() ── deduplicate & select
    │       │
    │       ├─ get_backend(device).get_fusion_pair_priority(n1, n2)
    │       │   Backend interface: CPU/CUDA each decide fusion-method priority
    │       │
    │       └─ Same pair may arrive from different grouping paths
    │          → keep only the highest-priority entry
    │
    ├─ 2. score_fusion_key()  ─────────────────────── score & sort candidates
    │   │
    │   └─► V.choices.score_fusion()
    │       Based on three dimensions:
    │         • Fusion type (template / reduction / ...)
    │         • Estimated memory bandwidth saved
    │         • Topological distance in the original graph (closer = better)
    │
    └─ 3. _try_fusion_pairs()  ────────────────────── attempt fusions in rank order
        Order is critical: fusing (A,B) first invalidates (B,C)

Phase 1: Finding Candidate Pairs - `get_possible_fusions`

The goal here is to sift through the entire graph and produce every node pair that is both possible and worthwhile to fuse.

Buffer Grouping - The Prerequisite for Fusion

The code first iterates over all fusable nodes and buckets each one by the buffer_name values it reads or writes, building a grouping dictionary. The intuition is straightforward: if two nodes share no buffers, fusing them is unlikely to yield any benefit because there is no intermediate buffer to eliminate and no memory traffic to save. So pair-checking is restricted to nodes within the same buffer group.

Window Optimization - Taming the Search Space

Within each buffer group, check_all_pairs enumerates pairwise candidates. A key optimization keeps this tractable: PyTorch only checks nodes within a window of plus or minus 64 neighbors in the node list. For a list of length n, this caps the number of candidate pairs at 64 * n rather than the naive n^2. This makes the fusion search feasible even on very large graphs.

`can_fuse` - Eight Categories of Gate Checks

Every candidate pair must survive the gauntlet of can_fuse(node1, node2). The checks include at least:

Identity: node1 == node2 so the pair is skipped immediately.
Special node block: Nodes like FusedMixOrderReductions that have already been fused cannot fuse again.
Template fast-path: Template nodes have a dedicated short-circuit that can approve fusion quickly.
Grouped node ban: GroupedSchedulerNode instances are already group-scheduled and barred from further fusion.
Topological dependency: The most critical check, ensuring fusion will not violate data-flow ordering.
Dtype compatibility, memory and size constraints, backend-specific limits, and other implementation guards.

An interesting detail: if can_fuse(n1, n2) fails but n2 is a template or foreach node, the code retries in the reversed direction with can_fuse(n2, n1). The reason is that template and foreach nodes are effectively container nodes that can absorb other nodes into themselves, so the fusion direction matters.

Aggressive Mode

When config.aggressive_fusion is enabled, an additional grouping pass runs based on node.group. The scheduler considers nodes in the same group to be part of a larger logical unit, making them prime candidates for more aggressive fusion attempts.

Phase 2: Deduplication and Scoring

Deduplication - `get_possible_fusions_with_highest_priority`

The same pair (node1, node2) may be discovered through different grouping paths, once from a buffer group and once from a node group. Different paths may imply different fusion strategies, but we only want the best one.

The arbiter is the backend interface get_backend(device).get_fusion_pair_priority(node1, node2). This is dynamic dispatch: the code first resolves the backend for the current device, such as CPU or CUDA, and then asks that backend to evaluate the pair priority. The base class returns 0 by default, but each backend is free to override this.

Scoring - `score_fusion_key`

After deduplication, each remaining candidate pair is scored via V.choices.score_fusion(). The scoring dimensions are:

Fusion type: Template fusions, reduction fusions, and other categories carry different weights.
Estimated memory bandwidth saved: The core payoff metric, measuring how much data movement can be eliminated.
Topological distance in the original graph: Closer pairs are preferred.

All candidates are sorted from highest to lowest score. That ordering directly determines the sequence in which fusions are attempted.

Phase 3: Attempting Fusions - `_try_fusion_pairs`

This is where fusions actually happen. The sorted order is paramount: candidates are tried from highest score to lowest, and once a node has been consumed by a fusion, any other candidate pair involving that node is automatically invalidated.

For example, suppose the candidate list contains (A, B) and (B, C), with (A, B) scoring higher. (A, B) will be fused first, after which (B, C) becomes infeasible because B has been absorbed into the fused node AB.

This greedy strategy, combined with the carefully designed scoring function, allows Inductor to find a high-quality fusion plan in reasonable time.

Takeaways

The design of fuse_nodes reflects several engineering trade-offs:

Window optimization reduces search complexity from O(n^2) to O(n) behavior in practice, keeping large graphs tractable.
Multi-path grouping with buffer groups, node groups, and aggressive mode captures fusion opportunities at different granularities.
Backend dispatch lets CPU and CUDA define their own fusion preferences independently.
Greedy ordering uses a simple but effective strategy to arbitrate between competing candidate pairs.

At a high level, this is a classic optimization pipeline: broad search to narrow filtering to greedy decision-making.