Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add bitpack encoding for LanceV2 #2333

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

albertlockett
Copy link
Contributor

Work in progress

TODO

  • improve tests
  • support signed types
  • handle case where buffer is all 0s
  • handle case where num compressed bits = num uncompressed bits

@github-actions github-actions bot added the enhancement New feature or request label May 13, 2024
Copy link

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Nice work so far. This looks like the correct general approach to me. Still some details to work out but nothing looks out of place.

_all_null: &mut bool,
) {
// TODO -- not sure if this is correct
buffers[0].0 = self.uncompressed_bits_per_value / 8 * num_rows as u64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works as long as uncompressed_bits_per_value is a multiple of 8 and, for now, it should always be so. If we have to start handling cases where it isn't we will need to update this.

}

fn decode_into(&self, rows_to_skip: u32, num_rows: u32, dest_buffers: &mut [BytesMut]) {
let mut bytes_to_skip = rows_to_skip as u64 * self.bits_per_value / 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rows_to_skip * self.bits_per_value isn't always going to be a multiple of 8. What happens when it isn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this logic wasn't correct. Reworked the decode_into method

Comment on lines 111 to 314
// pre-add enough capacity to the buffer to hold all the values we're about to put in it
let capacity_to_add = dst.capacity() as i64 - dst.len() as i64 + num_rows as i64;
if capacity_to_add > 0 {
let bytes_to_add =
capacity_to_add as usize * self.uncompressed_bits_per_value as usize / 8;
dst.extend((0..bytes_to_add).into_iter().map(|_| 0));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to do this. As long as update_capacity is returning a valid value then you should be able to safely assume the capacity is already there.

That being said, it doesn't really hurt to have this code. Maybe simpler to just put a debug_assert checking that there is enough capacity.

Comment on lines +120 to +325
let mut mask = 0u64;
for _ in 0..self.bits_per_value {
mask = mask << 1 | 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means you have a limit of 64 bits per value. This is probably fine but you should add a debug_assert somewhere verifying this.

Comment on lines 196 to 428
fn num_buffers(&self) -> u32 {
// TODO ask weston what this is about
1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is correct. There are some cases (e.g. dictionary encoding) where we encode 1 input buffer into 2 output buffers.


// additional metadata that should be present if bitpacking is used
optional BitpackMeta bitpack_meta = 4;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: I think of bitpacking less as an extension of Flat and more as it's own encoding that has another array encoding inside of it (like fixed_size_list). I don't know of any concrete reason that's better but I like thinking of these as small composable pieces rather than one piece with lots of options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, made this change

let mut dest = vec![BytesMut::new()];
unit.decode_into(0, 7, &mut dest);

println!("{:?}", dest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: convert to an assert when ready to move out of draft.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted this.. we have other tests covering this code path

Comment on lines 92 to 107
let mut packed_arrays = vec![];
for arr in arrays {
let packed = pack_array(arr.clone(), num_bits)?;
packed_arrays.push(packed.into());
}

let data_type = arrays[0].data_type();
let bits_per_value = 8 * data_type.byte_width() as u64;

Ok(EncodedBuffer {
bits_per_value: num_bits,
parts: packed_arrays,
bitpack_meta: Some(pb::BitpackMeta {
uncompressed_bits_per_value: bits_per_value,
}),
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to conditionally bitpack based on the whether num_bits is less than "native num bits" if that makes sense? E.g. if a number is using the full range then don't bitpack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure -- made this change

T: ArrowPrimitiveType,
T::Native: PrimInt + AsPrimitive<u64>,
{
let max = arrow::compute::bit_or(arr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this is convenient :)

Comment on lines 155 to 161
let buffers = data.buffers();
let mut packed_buffers = vec![];
for buffer in buffers {
let packed_buffer = pack_bits(&buffer, num_bits, byte_len);
packed_buffers.push(packed_buffer);
}
packed_buffers.concat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to pack the values buffer, I think this will also try and pack the validity buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're actually OK here. This gets passed the result of array.to_data() here:

match arr.data_type() {
DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => Ok(
pack_buffers(arr.to_data(), num_bits, arr.data_type().byte_width()),
),

And the validity buffer doesn't get included in that result. For example:

  let arr = UInt16Array::from(vec![Some(1), None, Some(2)]);
  let data = arr.to_data();
  let buffers = data.buffers();
  for buffer in buffers {
      println!("{:?}", buffer);
  }  

prints:

Buffer { data: Bytes { ptr: 0x124704e80, len: 6, data: [1, 0, 0, 0, 2, 0] }, ptr: 0x124704e80, length: 6 }

@albertlockett albertlockett changed the title feat: Add bitpac encoding for LanceV2 feat: Add bitpack encoding for LanceV2 May 13, 2024
// if if there's a partial byte at the end, we need to allocate one more byte
if (src_items * num_bits as usize) % 8 != 0 {
dst_bytes_total += 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do a divide_round_up here? like (src_items * num_bits as usize + 7) / 8;

// we wrote a partial byte
if bit_len % num_bits != 0 {
partial_bytes_written += 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can have a general divide_round_up function

@albertlockett albertlockett changed the title feat: Add bitpack encoding for LanceV2 feat: add bitpack encoding for LanceV2 May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants