Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] [stdlib] [proposal] Add unsafe transition from DTypePointer to SIMD (maybe even from List[DType]) #2729

Open
1 task done
martinvuyk opened this issue May 18, 2024 · 2 comments
Labels
enhancement New feature or request mojo-repo Tag all issues with this label

Comments

@martinvuyk
Copy link

Review Mojo's priorities

What is your request?

Say I have a List[UInt8] that I want to process. Let it have 16 items and say it's a UTF-8 number list (xored by 0x30 gives the number).

I found no intuitive and simple way to cast a DTypePointer[DType.uint8] to a SIMD and do an xor on it.

First related issue I raised was #2381 because I didn't have an entrypoint into simd or really understand it.

In issue #2695 , I tried doing var ptr = list_unsafe_ptr.bitcast[DType.uint64]()
ptr[offset] ^= 0x3030.. (8 times) for offset in range(2) but it doesn't edit the buffer

What is proposed?

var read: List[UInt8] = file.read_bytes(16)
var items = read.unsafe_simd[16]()
var res = items ^ 0x30

Ways to get there

List[DType]

fn unsafe_simd[list_size: Int, T: DType](owned self: List[T]) -> SIMD[T, list_size]:
  return DTypePointer(self.unsafe_ptr()).unsafe_simd[list_size]()

DTypePointer

fn unsafe_simd[size: Int, T: DType](owned self: DTypePointer[T]) -> SIMD[T, size]::
  # somehow steal data into SIMD

What is your motivation for this change?

Right now many interfaces use List[DType] for many operations. If we provide an intuitive api to go from there to SIMD vectors it'll be much easier to provide higher performance since people will be actually using SIMD instead of iterating over a List for everything.

Also I'm not sure but I think __contains__ methods could potentially become much faster if dtypepointer can be turned into a simd and search in a vectorized loop instead of iter (?). Though there would be copy overhead unless len(iterable) is large or able to be consumed when used

Any other details?

No response

@martinvuyk martinvuyk added enhancement New feature or request mojo-repo Tag all issues with this label labels May 18, 2024
@LJ-9801
Copy link
Contributor

LJ-9801 commented May 19, 2024

@martinvuyk Not sure if that is what you are referring to but if you want to process elements of a list in chucks using SIMD you can divide the list into equal portions and use SIMD operation.

var a = List[UInt8](1, 2, 3, 4, ...) # list of size 16
for i in range(4):
   # load a simd object with size 4
   var tmp = a.data.load[4](i*4)
   # do some operation using SIMD
   # ......
   a.data.store[4](i*4, tmp)

I don't think it make sense to convert an entire pointer array to SIMD since SIMD registers of a CPU usually only has width of 4(different story for GPUs but that's a different programming model and mojo GPU support is still not available yet). The compiler will probably help break down your SIMD size into ISA compatible SIMD width, but still using small SIMD width for parallelized operation seems to be a better practice.

@martinvuyk
Copy link
Author

@martinvuyk Not sure if that is what you are referring to but if you want to process elements of a list in chucks using SIMD you can divide the list into equal portions and use SIMD operation.

var a = List[UInt8](1, 2, 3, 4, ...) # list of size 16
for i in range(4):
   # load a simd object with size 4
   var tmp = a.data.load[4](i*4)
   # do some operation using SIMD
   # ......
   a.data.store[4](i*4, tmp)

I didn't know you could use a strided load with a pointer like that, pretty neat.

I don't think it make sense to convert an entire pointer array to SIMD since SIMD registers of a CPU usually only has width of 4(different story for GPUs but that's a different programming model and mojo GPU support is still not available yet). The compiler will probably help break down your SIMD size into ISA compatible SIMD width, but still using small SIMD width for parallelized operation seems to be a better practice.

I think the stdlib itself should allow for huge SIMD vector use whatever the underlying architecture, and let the function itself be sent to the CPU/accelerator and let the compiler optimize there.

This still requires a for loop and index access. What I meant was to do the equivalent of C's memcpy from one buffer to the other directly without any loops or index access. I have no idea if the underlying layout in memory for DTypePointer's pointee is the same as the SIMD vector's, so a simple API to unsafely go from one to the other would be useful IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mojo-repo Tag all issues with this label
Projects
None yet
Development

No branches or pull requests

2 participants