Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): incremental indexing for fts index #761

Draft
wants to merge 157 commits into
base: main
Choose a base branch
from

Conversation

changhiskhan
Copy link
Contributor

  • append
  • delete
  • merge
  • update
  • compaction
  • checkout
  • restore

AyushExel and others added 30 commits October 8, 2023 23:11
Co-authored-by: Lance Release <lance-dev@lancedb.com>
Co-authored-by: Rob Meng <rob.xu.meng@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
Co-authored-by: rmeng <rob@lancedb.com>
Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
If you run the README javascript example in typescript, it complains
that the type of limit is a function and cannot be set to a number.
Add `to_list` to return query results as list of python dict (so we're
not too pandas-centric). Closes #555

Add `to_pandas` API and add deprecation warning on `to_df`. Closes #545

Co-authored-by: Chang She <chang@lancedb.com>
A little verbose, but better than being non-discoverable 
![Screenshot from 2023-10-11
16-26-02](https://github.com/lancedb/lancedb/assets/15766192/9ba539a7-0cf8-4d9e-94e7-ce5d37c35df0)
This PR adds an overview of embeddings docs:
- 2 ways to vectorize your data using lancedb - explicit & implicit
- explicit - manually vectorize your data using `wit_embedding` function
- Implicit - automatically vectorize your data as it comes by ingesting
your embedding function details as table metadata
- Multi-modal example w/ disappearing embedding function
Add more APIs to remote table for Node SDK
* `add` rows
* `overwrite` table with rows
* `create` table

This has been tested against dev stack
Fix broken link to embedding functions

testing: broken link was verified after local docs build to have been
repaired

---------

Co-authored-by: Chang She <chang@lancedb.com>
Allows creation of funnels and user journeys
Sets things up for this -> #579
- Just separates out the registry/ingestion code from the function
implementation code
- adds a `get_registry` util
- package name "open-clip" -> "open-clip-torch"
To include latest v0.8.6

Co-authored-by: Chang She <chang@lancedb.com>
closes #564

---------

Co-authored-by: Chang She <chang@lancedb.com>
wjones127 and others added 29 commits December 20, 2023 13:06
This brings in some important bugfixes related to take and aarch64
Linux. See changes at:
https://github.com/lancedb/lance/releases/tag/v0.9.1
This forces the user to replace the whole FTS directory when re-creating
the index, prevent duplicate data from being created. Previously, the
whole dataset was re-added to the existing index, duplicating existing
rows in the index.

This (in combination with lancedb/lance#1707) caused #726, since the
duplicate data emitted duplicate indices for `take()` and an upstream
issue caused those queries to fail.

This solution isn't ideal, since it makes the FTS index temporarily
unavailable while the index is built. In the future, we should have
multiple FTS index directories, which would allow atomic commits of new
indexes (as well as multiple indexes for different columns).

Fixes #498.
Fixes #726.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
This PR adds issue templates, which help two recurring issues:

* Users forget to tell us whether they are using the Node or Python SDK
* Issues don't get appropriate tags

This doesn't force the use of the templates. Because we set
`blank_issues_enabled: true`, users can still create a custom issue.
Use pathlib for local paths so that pathlib
can handle the correct separator on windows.

Closes #703

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
This is a pretty direct binding to the underlying lance capability
This command hasn't been run for a while...
Modify some grammar, punctuation, and spelling errors.
For object detection, each row may correspond to an image and each image
can have multiple bounding boxes of x-y coordinates. This means that a
`bbox` field is potentially "list of list of float". This adds support
in our pydantic-pyarrow conversion for nested lists.
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
If you add timezone information in the Field annotation for a datetime
then that will now be passed to the pyarrow data type.

I'm not sure how pyarrow enforces timezones, right now, it silently
coerces to the timezone given in the column regardless of whether the
input had the matching timezone or not. This is probably not the right
behavior. Though we could just make it so the user has to make the
pydantic model do the validation instead of doing that at the pyarrow
conversion layer.
API has changed significantly, namely `openai.Embedding.create` no
longer exists.
openai/openai-python#742

Update the OpenAI embedding function and put a minimum on the openai sdk
version.
issue separate requests under the hood and concatenate results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet