Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPK update for hybridsim #102

Closed
wants to merge 2 commits into from
Closed

XPK update for hybridsim #102

wants to merge 2 commits into from

Conversation

tonyjohnchen
Copy link
Collaborator

@tonyjohnchen tonyjohnchen commented Mar 28, 2024

Fixes / Features

For AOT+Hybridsim integration, we need to pull Hybridsim docker image, so we need to mount /var/run/docker.sock (https://screenshot.googleplex.com/BD3SMwL57tP5cQY)
For running Hybridsim, we need to mount /tmp/xla_dump/ so the Hybridsim docker image can access to HLO graph (https://screenshot.googleplex.com/64Fwrjwt6g55orn)

Testing / Documentation

Tested AOT+Hybridsim e2e on a GKE node: https://screenshot.googleplex.com/5LQwBNBp4p6iyLq

This should be a combo PR of Maxtext and XPK:
google/maxtext#564
#102

  • [ y ] Tests pass
  • [ y ] Appropriate changes to documentation are included in the PR

@@ -123,10 +123,20 @@
volumeMounts:
- mountPath: /dev/shm
name: dshm-2
- mountPath: /var/run/docker.sock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to talk about the architecture, this doesn't make sense to me to hard code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to avoiding hardcoding this in the common path for TPUs/ CPUs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like both GPU and Pathway mount /tmp folder

https://github.com/google/xpk/blob/main/xpk.py#L325-L327
https://github.com/google/xpk/blob/main/xpk.py#L271-L272

Maybe I should mount /tmp folder instead of /tmp/xla_dump/?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU / Pathways has specific workload create files (also a sad state of duplication) but that's why it is ok for those mounts. The issue here is we are mounting something that doesn't always exist.

@Obliviour
Copy link
Collaborator

I assume this PR is not needed anymore? @tonyjohnchen let me know if we can close it!

@tonyjohnchen
Copy link
Collaborator Author

I assume this PR is not needed anymore? @tonyjohnchen let me know if we can close it!

Yeah, I can close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants