Move loop ordering after fusion #126255

shunting314 · 2024-05-15T01:09:27Z

🚀 The feature, motivation and pitch

Here is some brief context:
Right now inductor decides loop ordering before fusion. That can lose some fusion opportunities. E.g. if node1 and node2 pick incompatible loop orders before fusion, we can not fuse them. But if we delay the final decision of loop orders and manage to pick consistent loop orders for node1 and node2, we would be able to fuse them.

I create this issue since I've seen people showing interests about inductor supporting loop ordering after fusion previously. I'm wondering if people can provide examples that they found loop ordering after fusion will be helpful. That can help me more effectively develop this feature.

Right now I have some manually constructed examples. But would be nice to develop based on real use cases.

WIP PR: #126254 .

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @Chillee @lezcano @jansel

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

lezcano · 2024-05-15T06:06:25Z

@isuruf @vfdev-5 hit a number of cases when working on decompositions of pooling operations. See if they can repro any.

shunting314 · 2024-05-16T21:19:52Z

@isuruf @vfdev-5 can you share the examples that inductor does not fuse kernels due to loop ordering? I'll make sure they can fuse with my PR

isuruf · 2024-05-17T16:11:28Z

Running

python benchmarks/dynamo/timm_models.py --performance --amp -dcuda -n50 --no-skip --dashboard --batch-size 128 --threads 1 --only adv_inception_v3 --inference --freezing --timeout 9000 --backend=inductor

with the branch https://github.com/isuruf/pytorch/tree/debug_loop_fusion you'll see code of the form

triton_per_fused_avg_pool2d_11.run(buf12, buf22, 30105600, 9, grid=grid(30105600), stream=stream0)
buf23 = buf12; del buf12  # reuse
# Source Nodes: [branch_pool], Original ATen: [aten.avg_pool2d]
triton_poi_fused_avg_pool2d_12.run(buf22, buf23, 30105600, grid=grid(30105600), stream=stream0)

where the two kernels should have been fused. Full output at https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5

Is this okay or do you prefer a smaller reproducer?

shunting314 · 2024-05-17T22:49:31Z

Is this okay or do you prefer a smaller reproducer?

This is already very helpful to show loop ordering issue in real use cases. Although a smaller reproducer will be even more useful for faster iteration and adding as unit test.

shunting314 · 2024-05-29T01:11:54Z

where the two kernels should have been fused. Full output at https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5

I took a look at this example and I think it's different to 'normal' loop ordering issues we considered. The normal loop ordering issues we considered are

kernel A and B both pick the optimal loop orders by just looking at the memory access patterns in itself
A and B can not fuse because they pick different loop orders according to step 1.

But in this avg_pool2d kernel: https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5#file-triton-py-L729-L735 , the kernel should pick a different loop order so that index expression x2 + (192*x8) + (235200*x3) will be flattened to a single index. If inductor does that, then this kernel would be able get fused with the next kernel.

So the main issue here is what cause inductor to pick a 'sub-optimal' loop order for this kernel

shunting314 self-assigned this May 15, 2024

shunting314 added the module: inductor label May 15, 2024

pytorch-bot bot added the oncall: pt2 label May 15, 2024

xmfan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move loop ordering after fusion #126255

Move loop ordering after fusion #126255

shunting314 commented May 15, 2024 •

edited by pytorch-bot bot

lezcano commented May 15, 2024

shunting314 commented May 16, 2024

isuruf commented May 17, 2024

shunting314 commented May 17, 2024

shunting314 commented May 29, 2024 •

edited

Move loop ordering after fusion #126255

Move loop ordering after fusion #126255

Comments

shunting314 commented May 15, 2024 • edited by pytorch-bot bot

🚀 The feature, motivation and pitch

lezcano commented May 15, 2024

shunting314 commented May 16, 2024

isuruf commented May 17, 2024

shunting314 commented May 17, 2024

shunting314 commented May 29, 2024 • edited

shunting314 commented May 15, 2024 •

edited by pytorch-bot bot

shunting314 commented May 29, 2024 •

edited