Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move loop ordering after fusion #126255

Open
shunting314 opened this issue May 15, 2024 · 5 comments
Open

Move loop ordering after fusion #126255

shunting314 opened this issue May 15, 2024 · 5 comments
Assignees
Labels
module: inductor oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@shunting314
Copy link
Contributor

shunting314 commented May 15, 2024

馃殌 The feature, motivation and pitch

Here is some brief context:
Right now inductor decides loop ordering before fusion. That can lose some fusion opportunities. E.g. if node1 and node2 pick incompatible loop orders before fusion, we can not fuse them. But if we delay the final decision of loop orders and manage to pick consistent loop orders for node1 and node2, we would be able to fuse them.

I create this issue since I've seen people showing interests about inductor supporting loop ordering after fusion previously. I'm wondering if people can provide examples that they found loop ordering after fusion will be helpful. That can help me more effectively develop this feature.

Right now I have some manually constructed examples. But would be nice to develop based on real use cases.

WIP PR: #126254 .

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @Chillee @lezcano @jansel

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

@lezcano
Copy link
Collaborator

lezcano commented May 15, 2024

@isuruf @vfdev-5 hit a number of cases when working on decompositions of pooling operations. See if they can repro any.

@xmfan xmfan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 15, 2024
@shunting314
Copy link
Contributor Author

@isuruf @vfdev-5 can you share the examples that inductor does not fuse kernels due to loop ordering? I'll make sure they can fuse with my PR

@isuruf
Copy link
Collaborator

isuruf commented May 17, 2024

Running

python benchmarks/dynamo/timm_models.py --performance --amp -dcuda -n50 --no-skip --dashboard --batch-size 128 --threads 1 --only adv_inception_v3 --inference --freezing --timeout 9000 --backend=inductor

with the branch https://github.com/isuruf/pytorch/tree/debug_loop_fusion you'll see code of the form

triton_per_fused_avg_pool2d_11.run(buf12, buf22, 30105600, 9, grid=grid(30105600), stream=stream0)
buf23 = buf12; del buf12  # reuse
# Source Nodes: [branch_pool], Original ATen: [aten.avg_pool2d]
triton_poi_fused_avg_pool2d_12.run(buf22, buf23, 30105600, grid=grid(30105600), stream=stream0)

where the two kernels should have been fused. Full output at https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5

Is this okay or do you prefer a smaller reproducer?

@shunting314
Copy link
Contributor Author

Is this okay or do you prefer a smaller reproducer?

This is already very helpful to show loop ordering issue in real use cases. Although a smaller reproducer will be even more useful for faster iteration and adding as unit test.

@shunting314
Copy link
Contributor Author

shunting314 commented May 29, 2024

where the two kernels should have been fused. Full output at https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5

I took a look at this example and I think it's different to 'normal' loop ordering issues we considered. The normal loop ordering issues we considered are

  1. kernel A and B both pick the optimal loop orders by just looking at the memory access patterns in itself
  2. A and B can not fuse because they pick different loop orders according to step 1.

But in this avg_pool2d kernel: https://gist.github.com/isuruf/f0256920468be327e4bf30d0a5503df5#file-triton-py-L729-L735 , the kernel should pick a different loop order so that index expression x2 + (192*x8) + (235200*x3) will be flattened to a single index. If inductor does that, then this kernel would be able get fused with the next kernel.

So the main issue here is what cause inductor to pick a 'sub-optimal' loop order for this kernel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: inductor oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants