feat(python): incremental indexing for fts index #761

changhiskhan · 2024-01-02T07:46:33Z

Co-authored-by: Lance Release <lance-dev@lancedb.com> Co-authored-by: Rob Meng <rob.xu.meng@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: rmeng <rob@lancedb.com> Co-authored-by: Chang She <chang@lancedb.com> Co-authored-by: Rok Mihevc <rok@mihevc.org>

If you run the README javascript example in typescript, it complains that the type of limit is a function and cannot be set to a number.

Add `to_list` to return query results as list of python dict (so we're not too pandas-centric). Closes #555 Add `to_pandas` API and add deprecation warning on `to_df`. Closes #545 Co-authored-by: Chang She <chang@lancedb.com>

A little verbose, but better than being non-discoverable ![Screenshot from 2023-10-11 16-26-02](https://github.com/lancedb/lancedb/assets/15766192/9ba539a7-0cf8-4d9e-94e7-ce5d37c35df0)

#488

This PR adds an overview of embeddings docs: - 2 ways to vectorize your data using lancedb - explicit & implicit - explicit - manually vectorize your data using `wit_embedding` function - Implicit - automatically vectorize your data as it comes by ingesting your embedding function details as table metadata - Multi-modal example w/ disappearing embedding function

Bump lance to 0.5.8

Add more APIs to remote table for Node SDK * `add` rows * `overwrite` table with rows * `create` table This has been tested against dev stack

Fix broken link to embedding functions testing: broken link was verified after local docs build to have been repaired --------- Co-authored-by: Chang She <chang@lancedb.com>

Allows creation of funnels and user journeys

Sets things up for this -> #579 - Just separates out the registry/ingestion code from the function implementation code - adds a `get_registry` util - package name "open-clip" -> "open-clip-torch"

To include latest v0.8.6 Co-authored-by: Chang She <chang@lancedb.com>

closes #564 --------- Co-authored-by: Chang She <chang@lancedb.com>

This brings in some important bugfixes related to take and aarch64 Linux. See changes at: https://github.com/lancedb/lance/releases/tag/v0.9.1

This forces the user to replace the whole FTS directory when re-creating the index, prevent duplicate data from being created. Previously, the whole dataset was re-added to the existing index, duplicating existing rows in the index. This (in combination with lancedb/lance#1707) caused #726, since the duplicate data emitted duplicate indices for `take()` and an upstream issue caused those queries to fail. This solution isn't ideal, since it makes the FTS index temporarily unavailable while the index is built. In the future, we should have multiple FTS index directories, which would allow atomic commits of new indexes (as well as multiple indexes for different columns). Fixes #498. Fixes #726. --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>

This PR adds issue templates, which help two recurring issues: * Users forget to tell us whether they are using the Node or Python SDK * Issues don't get appropriate tags This doesn't force the use of the templates. Because we set `blank_issues_enabled: true`, users can still create a custom issue.

Use pathlib for local paths so that pathlib can handle the correct separator on windows. Closes #703 --------- Co-authored-by: Will Jones <willjones127@gmail.com>

) Closes #639 Closes #638

This is a pretty direct binding to the underlying lance capability

Closes #705

This command hasn't been run for a while...

Modify some grammar, punctuation, and spelling errors.

For object detection, each row may correspond to an image and each image can have multiple bounding boxes of x-y coordinates. This means that a `bbox` field is potentially "list of list of float". This adds support in our pydantic-pyarrow conversion for nested lists.

Closes #721 fts will return results as a pyarrow table. Pyarrow tables has a `filter` method but it does not take sql filter strings (only pyarrow compute expressions). Instead, we do one of two things to support `tbl.search("keywords").where("foo=5").limit(10).to_arrow()`: Default path: If duckdb is available then use duckdb to execute the sql filter string on the pyarrow table. Backup path: Otherwise, write the pyarrow table to a lance dataset and then do `to_table(filter=<filter>)` Neither is ideal. Default path has two issues: 1. requires installing an extra library (duckdb) 2. duckdb mangles some fields (like fixed size list => list) Backup path incurs a latency penalty (~20ms on ssd) to write the resultset to disk. In the short term, once #676 is addressed, we can write the dataset to "memory://" instead of disk, this makes the post filter evaluate much quicker (ETA next week). In the longer term, we'd like to be able to evaluate the filter string on the pyarrow Table directly, one possibility being that we use Substrait to generate pyarrow compute expressions from sql string. Or if there's enough progress on pyarrow, it could support Substrait expressions directly (no ETA) --------- Co-authored-by: Will Jones <willjones127@gmail.com>

If you add timezone information in the Field annotation for a datetime then that will now be passed to the pyarrow data type. I'm not sure how pyarrow enforces timezones, right now, it silently coerces to the timezone given in the column regardless of whether the input had the matching timezone or not. This is probably not the right behavior. Though we could just make it so the user has to make the pydantic model do the validation instead of doing that at the pyarrow conversion layer.

API has changed significantly, namely `openai.Embedding.create` no longer exists. openai/openai-python#742 Update the OpenAI embedding function and put a minimum on the openai sdk version.

issue separate requests under the hood and concatenate results

AyushExel and others added 30 commits October 8, 2023 23:11

Use query.limit(..) in README (#543)

ababc3f

If you run the README javascript example in typescript, it complains that the type of limit is a function and cannot be set to a number.

feat: add to_list and to_pandas api's (#556)

e1ae2bc

Add `to_list` to return query results as list of python dict (so we're not too pandas-centric). Closes #555 Add `to_pandas` API and add deprecation warning on `to_df`. Closes #545 Co-authored-by: Chang She <chang@lancedb.com>

[Docs] Improve visibility of table ops (#553)

e41894b

A little verbose, but better than being non-discoverable ![Screenshot from 2023-10-11 16-26-02](https://github.com/lancedb/lancedb/assets/15766192/9ba539a7-0cf8-4d9e-94e7-ce5d37c35df0)

feat: cleanup and compaction (#518)

db7bdef

#488

Add cohere embedding function (#550)

683824f

[python] Bump version: 0.3.0 → 0.3.1

1096da0

Bump version: 0.3.0 → 0.3.1

8f6e955

Updating package-lock.json

0bdc714

Updating package-lock.json

f762a66

chore: bump lance to 0.8.5 (#561)

eff94ec

Bump lance to 0.5.8

docs: switch python examples to be row based (#554)

6d66404

feat(python,js): deletion operation on remote tables (#568)

fe64fc4

docs: show source of documented functions (#569)

043e388

implement remote api calls for table mutation (#567)

345c136

Add more APIs to remote table for Node SDK * `add` rows * `overwrite` table with rows * `create` table This has been tested against dev stack

Bump version: 0.3.1 → 0.3.2

02c35d3

Updating package-lock.json

bc85a74

Updating package-lock.json

1b8cda0

doc: fix broken link and add README (#573)

bb01ad5

Fix broken link to embedding functions testing: broken link was verified after local docs build to have been repaired --------- Co-authored-by: Chang She <chang@lancedb.com>

Add pyarrow date and timestamp type conversion from pydantic (#576)

86efb11

list table pagination draft (#574)

d46bc5d

[Docs] Add posthog telemetry to docs (#577)

7372656

Allows creation of funnels and user journeys

[Python]Embeddings API refactor (#580)

0293bbe

Sets things up for this -> #579 - Just separates out the registry/ingestion code from the function implementation code - adds a `get_registry` util - package name "open-clip" -> "open-clip-torch"

[Docs] Update embedding function docs (#581)

a8c7f80

chore: bump lance version in python/rust lancedb (#584)

0ed39b6

To include latest v0.8.6 Co-authored-by: Chang She <chang@lancedb.com>

Bump version: 0.3.2 → 0.3.3

0a30591

Updating package-lock.json

f36fea8

Updating package-lock.json

6bd3a83

[Docs]Versioning docs (#586)

06b5b69

closes #564 --------- Co-authored-by: Chang She <chang@lancedb.com>

wjones127 and others added 29 commits December 20, 2023 13:06

upgrade lance to v0.9.1 (#727)

1d49436

This brings in some important bugfixes related to take and aarch64 Linux. See changes at: https://github.com/lancedb/lance/releases/tag/v0.9.1

ci: check formatting and clippy (#730)

2c7f96b

bug(python): fix path handling in windows (#724)

7bbb287

Use pathlib for local paths so that pathlib can handle the correct separator on windows. Closes #703 --------- Co-authored-by: Will Jones <willjones127@gmail.com>

doc(javascript): minor improvement on docs for working with tables (#736

0965d7d

) Closes #639 Closes #638

feat: node list tables pagination (#733)

50c20af

feat: add the ability to create scalar indices (#679)

dc5126d

This is a pretty direct binding to the underlying lance capability

docs: fix JS api docs for update method (#738)

7561883

docs: enhance Update user guide (#735)

3496631

Closes #705

docs: update node API reference (#734)

ee0f061

This command hasn't been run for a while...

Update default_embedding_functions.md (#744)

eab9072

Modify some grammar, punctuation, and spelling errors.

Bump version: 0.4.0 → 0.4.1

bb100c5

[python] Bump version: 0.4.0 → 0.4.1

6026001

Updating package-lock.json

0df3834

fix: createIndex index cache size (#741)

446f837

chore(python): update embedding API to use openai 1.6.1 (#751)

7e75e50

API has changed significantly, namely `openai.Embedding.create` no longer exists. openai/openai-python#742 Update the OpenAI embedding function and put a minimum on the openai sdk version.

[python] Bump version: 0.4.1 → 0.4.2

3927779

feat(python): first cut batch queries for remote api (#753)

7773bda

issue separate requests under the hood and concatenate results

docs: fix link (#752)

8411c36

chore: bump pylance to 0.9.2 (#754)

a9caa5f

[python] Bump version: 0.4.2 → 0.4.3

c3059dc

Bump version: 0.4.1 → 0.4.2

065ffde

Updating package-lock.json

8e248a9

Updating package-lock.json

4e3b82f

handles appends

36d80e8

westonpace force-pushed the main branch from 93c8786 to 9fee384 Compare April 5, 2024 23:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): incremental indexing for fts index #761

feat(python): incremental indexing for fts index #761

changhiskhan commented Jan 2, 2024

feat(python): incremental indexing for fts index #761

Are you sure you want to change the base?

feat(python): incremental indexing for fts index #761

Conversation

changhiskhan commented Jan 2, 2024