Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to integrate DuckDB for text-based search & filtering #71

Open
davidnmora opened this issue Feb 28, 2023 · 3 comments
Open

How to integrate DuckDB for text-based search & filtering #71

davidnmora opened this issue Feb 28, 2023 · 3 comments

Comments

@davidnmora
Copy link
Contributor

davidnmora commented Feb 28, 2023

Adding my Slack convo with @bmschmidt here for public reference

Ben's general guidance:

What I would do is create a new column that is a float of ‘1’ or ‘0’ in duckdb for every tile in the dataset.

That happens by applying a transformation to the data, but you hopefully won’t need to do it manually.
But there are a couple ways: if the percentage of matches in the dataset is pretty low, you could use add_sparse_identifiers to add a set of matching ids.

You could also manually define a transformation to do it. add_tiled_column does something quite similar: it defines a new transformation on the Dataset that consumes a tile. As with that function, you probably don’t want to do something like run a different duckdb search for every tile; instead, run a single search, and then write some logic that can create an object called records which keys from the tile identifier to a JS native Float32 Array (which is usually not too hard to pull off an Arrow float32),

    this.transformations[field_name] = function (tile) {
      const { key } = tile;
      const length = tile.record_batch.numRows;
      // an object you could create called `records` which keys from the tile identifier to a JS native Float32 Array
      const array = records[key];
      return bind_column(tile.record_batch, field_name, array);
    };

This is a new part of the codebase and not documented yet, so let me know if that’s not enough to go on.

The point of defining transformations at the tile level like this is that it staggers the load of updating the WebGL buffers across multiple animation ticks; it would also be possible to pull back the abstraction a little bit and simply update the record batches in place by adding a column for each one all at the same time when the duckdb results came back. For less than a million rows, it’s probably about the same--but at higher numbers of points, it would get painful waiting for transformations to be applied way down the chain.

All this requires no changes to the core deepscatter code. But integrating duckdb might require getting a little into the weeds with Arrow, since on the JS side the core unit in deepscatter is basically not the ‘point’ but actually the tile, which corresponds to a single Arrow record batch.

Only in the last couple months have I settled on a strategy for altering these record batches, which is this new ‘transformations’ binding.

@bmschmidt
Copy link
Collaborator

There's also some movement here I should mention.

Transformations on a working branch are allowed to return promises. Since duckdb-wasm queries are async, that means you could package an Arrow record batch on a tile, send it to the duckDB with arbitrary SQL, and then get a record batch back.

@bmschmidt
Copy link
Collaborator

So this is a little more straightforward now, although the API is still in flux. Here's what the steps would be on this new branch.

  1. Define a transformation function that works on Tile objects with a signature (tile : Tile) => Promise<Float32Array>, where the array is the same length as the tile. For example, you could do:
[...]
scatterplot._root.transformations['has dog in it'] = async function(tile) {
  const output = new Float32Array(tile.record_batch.numRows)
  // the column being searched.
  const all_rows = tile.record_batch.getChild("full_text")
  const all_strings = all_rows.toArray(); // The reason to use duckdb is that this function, deserializing a lot of text from arrow UTF-8 to javascript UTF-16 strings, is *extremely* slow.
  let i = 0;'
  for (let string of all_strings) {
     if (string.match(/dog/) {
        output[i] = 1;// Store a match as a float.
     }
  }
  return output // This array will be attached to the record batch at render-time, lazily.
}

(It's allowed for the promise not to be async, which is what you'd want in this simple JS regex case. But with duckdb you would ship the array of full_text to the db, and then get back the results of a db query as arrow.)

And then actually plot it, which causes the transformation to be run on tiles as they're needed.

scatterplot.plotAPI({encoding: {
foreground: {
field: 'has dog in it',
op: 'eq',
a: 1
}}})

@bmschmidt
Copy link
Collaborator

If anyone's following along at home, David and I got a prototype of this running over the weekend. https://observablehq.com/d/cae8e4a3a8b7d4db The hardest part turned out to be misalignment of Arrow versions among arrow-js, duckdb, and deepscatter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants