[Feature Request] [stdlib] [proposal] Add unsafe transition from DTypePointer to SIMD (maybe even from List[DType]) #2729

martinvuyk · 2024-05-18T02:27:13Z

Review Mojo's priorities

I have read the roadmap and priorities and I believe this request falls within the priorities.

What is your request?

Say I have a List[UInt8] that I want to process. Let it have 16 items and say it's a UTF-8 number list (xored by 0x30 gives the number).

I found no intuitive and simple way to cast a DTypePointer[DType.uint8] to a SIMD and do an xor on it.

First related issue I raised was #2381 because I didn't have an entrypoint into simd or really understand it.

In issue #2695 , I tried doing var ptr = list_unsafe_ptr.bitcast[DType.uint64]()
ptr[offset] ^= 0x3030.. (8 times) for offset in range(2) but it doesn't edit the buffer

What is proposed?

var read: List[UInt8] = file.read_bytes(16)
var items = read.unsafe_simd[16]()
var res = items ^ 0x30

Ways to get there

List[DType]

fn unsafe_simd[list_size: Int, T: DType](owned self: List[T]) -> SIMD[T, list_size]:
  return DTypePointer(self.unsafe_ptr()).unsafe_simd[list_size]()

DTypePointer

fn unsafe_simd[size: Int, T: DType](owned self: DTypePointer[T]) -> SIMD[T, size]::
  # somehow steal data into SIMD

What is your motivation for this change?

Right now many interfaces use List[DType] for many operations. If we provide an intuitive api to go from there to SIMD vectors it'll be much easier to provide higher performance since people will be actually using SIMD instead of iterating over a List for everything.

Also I'm not sure but I think __contains__ methods could potentially become much faster if dtypepointer can be turned into a simd and search in a vectorized loop instead of iter (?). Though there would be copy overhead unless len(iterable) is large or able to be consumed when used

Any other details?

No response

The text was updated successfully, but these errors were encountered:

LJ-9801 · 2024-05-19T09:33:46Z

@martinvuyk Not sure if that is what you are referring to but if you want to process elements of a list in chucks using SIMD you can divide the list into equal portions and use SIMD operation.

var a = List[UInt8](1, 2, 3, 4, ...) # list of size 16
for i in range(4):
   # load a simd object with size 4
   var tmp = a.data.load[4](i*4)
   # do some operation using SIMD
   # ......
   a.data.store[4](i*4, tmp)

I don't think it make sense to convert an entire pointer array to SIMD since SIMD registers of a CPU usually only has width of 4(different story for GPUs but that's a different programming model and mojo GPU support is still not available yet). The compiler will probably help break down your SIMD size into ISA compatible SIMD width, but still using small SIMD width for parallelized operation seems to be a better practice.

martinvuyk · 2024-05-19T14:11:03Z

@martinvuyk Not sure if that is what you are referring to but if you want to process elements of a list in chucks using SIMD you can divide the list into equal portions and use SIMD operation.
var a = List[UInt8](1, 2, 3, 4, ...) # list of size 16
for i in range(4):
   # load a simd object with size 4
   var tmp = a.data.load[4](i*4)
   # do some operation using SIMD
   # ......
   a.data.store[4](i*4, tmp)

I didn't know you could use a strided load with a pointer like that, pretty neat.

I don't think it make sense to convert an entire pointer array to SIMD since SIMD registers of a CPU usually only has width of 4(different story for GPUs but that's a different programming model and mojo GPU support is still not available yet). The compiler will probably help break down your SIMD size into ISA compatible SIMD width, but still using small SIMD width for parallelized operation seems to be a better practice.

I think the stdlib itself should allow for huge SIMD vector use whatever the underlying architecture, and let the function itself be sent to the CPU/accelerator and let the compiler optimize there.

This still requires a for loop and index access. What I meant was to do the equivalent of C's memcpy from one buffer to the other directly without any loops or index access. I have no idea if the underlying layout in memory for DTypePointer's pointee is the same as the SIMD vector's, so a simple API to unsafely go from one to the other would be useful IMO.

martinvuyk added enhancement New feature or request mojo-repo Tag all issues with this label labels May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] [stdlib] [proposal] Add unsafe transition from DTypePointer to SIMD (maybe even from List[DType]) #2729

[Feature Request] [stdlib] [proposal] Add unsafe transition from DTypePointer to SIMD (maybe even from List[DType]) #2729

martinvuyk commented May 18, 2024

LJ-9801 commented May 19, 2024 •

edited

martinvuyk commented May 19, 2024

[Feature Request] [stdlib] [proposal] Add unsafe transition from DTypePointer to SIMD (maybe even from List[DType]) #2729

[Feature Request] [stdlib] [proposal] Add unsafe transition from DTypePointer to SIMD (maybe even from List[DType]) #2729

Comments

martinvuyk commented May 18, 2024

Review Mojo's priorities

What is your request?

What is proposed?

Ways to get there

What is your motivation for this change?

Any other details?

LJ-9801 commented May 19, 2024 • edited

martinvuyk commented May 19, 2024

LJ-9801 commented May 19, 2024 •

edited