Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double Occurrence of Parameters in Kernels Generated with CLI tool. #557

Open
ThrudPrimrose opened this issue Oct 8, 2023 · 0 comments
Open

Comments

@ThrudPrimrose
Copy link

ThrudPrimrose commented Oct 8, 2023

If I generate a kernel with the following command in the CLI tool:
taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor

The generated kernel has the parameter C twice in the launcher function. I will add the generated code:


// Generated by the Tensor Algebra Compiler (tensor-compiler.org)

__global__
void computeDeviceKernel0(taco_tensor_t * __restrict__ A, taco_tensor_t * __restrict__ B, taco_tensor_t * __restrict__ C, taco_tensor_t * __restrict__ w){
  int A2_dimension = (int)(A->dimensions[1]);
  int A3_dimension = (int)(A->dimensions[2]);
  float* __restrict__ A_vals = (float*)(A->vals);
  int B2_dimension = (int)(B->dimensions[1]);
  int B3_dimension = (int)(B->dimensions[2]);
  int B4_dimension = (int)(B->dimensions[3]);
  float* __restrict__ B_vals = (float*)(B->vals);
  int C1_dimension = (int)(C->dimensions[0]);
  int C2_dimension = (int)(C->dimensions[1]);
  int C3_dimension = (int)(C->dimensions[2]);
  float* __restrict__ C_vals = (float*)(C->vals);
  int w1_dimension = (int)(w->dimensions[0]);
  int w2_dimension = (int)(w->dimensions[1]);
  float* __restrict__ w_vals = (float*)(w->vals);

  int32_t i161 = blockIdx.x;
  int32_t i162 = (threadIdx.x % (256));
  if (threadIdx.x >= 256) {
    return;
  }

  int32_t i = i161 * 256 + i162;
  if (i >= C1_dimension)
    return;

  for (int32_t j = 0; j < C2_dimension; j++) {
    int32_t jC = i * C2_dimension + j;
    for (int32_t b = 0; b < C3_dimension; b++) {
      int32_t bC = jC * C3_dimension + b;
      float tl_val = 0.0;
      for (int32_t l = 0; l < B3_dimension; l++) {
        int32_t jA = l * A2_dimension + j;
        int32_t bA = jA * A3_dimension + b;
        float tk_val = 0.0;
        for (int32_t k = 0; k < w1_dimension; k++) {
          int32_t kB = i * B2_dimension + k;
          int32_t lB = kB * B3_dimension + l;
          int32_t bB = lB * B4_dimension + b;
          int32_t bw = k * w2_dimension + b;
          tk_val = tk_val + (A_vals[bA] * B_vals[bB]) * w_vals[bw];
        }
        tl_val = tl_val + tk_val;
      }
      C_vals[bC] = C_vals[bC] + tl_val;
    }
  }
}

int compute(taco_tensor_t *C, taco_tensor_t *A, taco_tensor_t *B, taco_tensor_t *w, taco_tensor_t *C) {
  int C1_dimension = (int)(C->dimensions[0]);

  computeDeviceKernel0<<<(C1_dimension + 255) / 256, 256>>>(A, B, C, w);
  cudaDeviceSynchronize();
  return 0;
}

I have built taco with gcc-11, optimize flags -fPIC -O3, with pybind11 and cuda bundled with nvhpc 23.9 from source. The commit hash (output from git log -n 1, I hope it is the correct command to use here) is:
git log -n 1
commit 2b8ece4

Also, one more question, I wanted to use the index "b" to create batched tensor contractions in this case, and I hoped that the kernel would distribute the workload using the C->dimensions[3] because the last index is definitely the biggest one, I also have provided it in the command line argument with the hope that it would be used, what am I doing there wrong? Should I provide a schedule, or what should I do for that?

I tried:

taco "C(i, j, b) = C(i, j, b) + A(l, j, b) * B(i, k, l, b) * w(k, b)" -cuda -d=A:32,32,25866 -d=B:32,32,32,25866 -d=C:32,32,25866 -d=w:32,25866 -t=A:float -t=B:float -t=C:float -t=w:float -print-nocolor -s="parallelize(b, GPUBlock, NoRaces)"
// Generated by the Tensor Algebra Compiler (tensor-compiler.org)

terminate called after throwing an instance of 'taco::TacoException'
  what():  Compiler bug at /home/primrose/Installed/taco/src/codegen/codegen_cuda.cpp:374 in visit
Please report it to developers
 Condition failed: blockIDVars.size() == threadIDVars.size()
 No matching GPUThread parallelize 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant