Fusion-R1: Empower PyTorch Graph Fusion via LLM with Reinforcement Learning

Authors: Xueyan Zhang, Jiakang Huang, YiTian Ding, Shuhao Guan, Yining Wang, Yuchen Li, Jinman Zhao, Gerald Penn

Status: Under review at NeurIPS 2026
Primary Area: SysML infrastructure
Submission Number: 5708

Abstract: The fusion pass in PyTorch 2.0 TorchInductor is efficient and reliable, but its local pairwise rule-based heuristics may miss globally beneficial operator fusion groups. This paper studies whether Large Language Models (LLMs) can learn graph-level fusion policies rather than merely generate low-level kernel code. To support this study, we construct GraphFusion6K and GraphFusionBench, a training and evaluation suite built from real PyTorch workloads and public kernel benchmarks, covering diverse graph sizes, topologies, routing complexity, tensor ranks, and operator compositions. We further propose Fusion-R1, an LLM-based fusion proposal layer for TorchInductor. Fusion-R1 serializes scheduler graphs into a compact AdjL text format and generates machine-parseable JSONL fusion groups; before being applied to the compiler, these proposals are checked by legality validation and cycle detection. The model is first trained with supervised fine-tuning for cold start, and then optimized with GRPO using rewards based on output-format correctness and legality-preserving node/kernel reduction. On GraphFusionBench, Fusion-R1 improves the speedup over PyTorch eager mode from 1.74× with rule-based TorchInductor to 1.83×, while reducing the average kernel count from 99.10 to 77.18. Fusion-R1 also outperforms general LLM baselines on GraphFusionBench, TorchBench, and TIMM, suggesting that LLMs can serve as a practical graph-level optimization layer for deep learning compilers.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)