feat(python): A semantically searchable pydantic model #932

changhiskhan · 2024-02-06T06:25:16Z

from typing import Optional
from lancedb.pydantic import SearchableModel, Vector
from lancedb.embeddings import get_registry

# Define embedding function
registry = get_registry()
openai = registry.get("openai").create(name="text-embedding-3-small", dim=256)

# Define model (using embedding function)
class Document(SearchableModel):
    id: int
    text: str = openai.SourceField()
    vector: Optional[Vector(openai.ndims())] = openai.VectorField(default=None)

# bind to database
db = lancedb.connect("~/.lancedb")
Document.bind(db)

# ingest data
Document.upsert([Document(id=1, text="hello world"),
                 Document(id=2, text="goodbye world")])

# search
Document.search("greetings").limit(1).get_instances()

The last line should return [Document(id=1, text='hello world', vector=FixedSizeList(dim=256))]

Colab has pydantic 1.x by default and pydantic 1.x BaseModel objects don't support weakref creation by default that we use to cache embedding models https://github.com/lancedb/lancedb/blob/main/python/lancedb/embeddings/utils.py#L206 . It needs to be added to slot.

Please note: this is not tested as we don't have a server here and testing against a mock object wouldn't be that interesting.

Sitemap improves SEO by ranking pages and tracking updates.

This is to enable #641. Should be merged after lancedb/lance#1587 is released.

Readying for the next Lance release.

@AyushExel

In this PR, I add a guide that lets you use Roboflow Inference to calculate CLIP embeddings for use in LanceDB. This post was reviewed by @AyushExel.

expose prefilter flag in vectordb rust code.

enable prefiltering in node js, both native and remote

was passing this at the wrong position

there's build failure for the rust artifact but the macos arm64 build for npm publish still passed. So we had a silent failure for 2 releases. By setting error to immediate this should cause fail immediately.

We had some build issues with npm publish for cross-compiling arm64 macos on an x86 macos runner. Switching to m1 runner for now until someone has time to deal with the feature flags. follow-up tracked here: #688

adding some badges added a gif to readme for the vectordb repo --------- Co-authored-by: kaushal07wick <kaushalc6@gmail.com>

Passed the following tests ```ts const keyId = process.env.AWS_ACCESS_KEY_ID; const secretKey = process.env.AWS_SECRET_ACCESS_KEY; const sessionToken = process.env.AWS_SESSION_TOKEN; const region = process.env.AWS_REGION; const db = await lancedb.connect({ uri: "s3://bucket/path", awsCredentials: { accessKeyId: keyId, secretKey: secretKey, sessionToken: sessionToken, }, awsRegion: region, } as lancedb.ConnectionOptions); console.log(await db.createTable("test", [{ vector: [1, 2, 3] }])); console.log(await db.tableNames()); console.log(await db.dropTable("test")) ```

- fix the repo link on npm - add links for homepage and bug report

… runtime (#909) Github action is deprecating old node-16 runtime.

This adds the python bindings requested in #870 The javascript/rust bindings will be added in a future PR.

…entation purposes (#900) Closes #819

Adds capability to the remote python SDK to retry requests (fixes #911) This can be configured through environment: - `LANCE_CLIENT_MAX_RETRIES`= total number of retries. Set to 0 to disable retries. default = 3 - `LANCE_CLIENT_CONNECT_RETRIES` = number of times to retry request in case of TCP connect failure. default = 3 - `LANCE_CLIENT_READ_RETRIES` = number of times to retry request in case of HTTP request failure. default = 3 - `LANCE_CLIENT_RETRY_STATUSES` = http statuses for which the request will be retried. passed as comma separated list of ints. default `500, 502, 503` - `LANCE_CLIENT_RETRY_BACKOFF_FACTOR` = controls time between retry requests. see [here](https://github.com/urllib3/urllib3/blob/23f2287eb526d9384dddeedb6f6345e263bb9b86/src/urllib3/util/retry.py#L141-L146). default = 0.25 Only read requests will be retried: - list table names - query - describe table - list table indices This does not add retry capabilities for writes as it could possibly cause issues in the case where the retried write isn't idempotent. For example, in the case where the LB times-out the request but the server completes the request anyway, we might not want to blindly retry an insert request.

<img width="837" alt="Screenshot 2024-02-01 at 4 23 34 PM" src="https://github.com/lancedb/lancedb/assets/1305083/4f0f5c5a-2a24-4b00-aad1-ef80a593d964"> [ <img width="838" alt="Screenshot 2024-02-01 at 4 26 03 PM" src="https://github.com/lancedb/lancedb/assets/1305083/ca073bc8-b518-4be3-811d-8a7184416f07"> ](url) --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

* Closes #895 * Fix cargo clippy

@PrashantDixit0

@PrashantDixit0 --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>

This PR refactors how we handle read consistency: does the `LanceTable` class always pick up modifications to the table made by other instance or processes. Users have three options they can set at the connection level: 1. (Default) `read_consistency_interval=None` means it will not check at all. Users can call `table.checkout_latest()` to manually check for updates. 2. `read_consistency_interval=timedelta(0)` means **always** check for updates, giving strong read consistency. 3. `read_consistency_interval=timedelta(seconds=20)` means check for updates every 20 seconds. This is eventual consistency, a compromise between the two options above. ## Table reference state There is now an explicit difference between a `LanceTable` that tracks the current version and one that is fixed at a historical version. We now enforce that users cannot write if they have checked out an old version. They are instructed to call `checkout_latest()` before calling the write methods. Since `conn.open_table()` doesn't have a parameter for version, users will only get fixed references if they call `table.checkout()`. The difference between these two can be seen in the repr: Table that are fixed at a particular version will have a `version` displayed in the repr. Otherwise, the version will not be shown. ```python >>> table LanceTable(connection=..., name="my_table") >>> table.checkout(1) >>> table LanceTable(connection=..., name="my_table", version=1) ``` I decided to not create different classes for these states, because I think we already have enough complexity with the Cloud vs OSS table references. Based on #812

AyushExel and others added 30 commits November 10, 2023 15:02

fix: node send db header for GET requests (#646)

479f471

feat: add RemoteTable.version in Python (#644)

1c872ce

Please note: this is not tested as we don't have a server here and testing against a mock object wouldn't be that interesting.

fix: node remote implement table.countRows (#648)

797514b

Bump version: 0.3.6 → 0.3.7

ebcf9bf

Updating package-lock.json

54677b8

Updating package-lock.json

5372843

SaaS create_index API (#649)

1cf8a3e

[Docs][SEO] Add sitemap and robots.txt (#645)

87e5d86

Sitemap improves SEO by ranking pages and tracking updates.

[Docs]: Add Instructor embeddings and rate limit handler docs (#651)

ccfdf48

feat(python): expose index cache size (#655)

d8e3e54

This is to enable #641. Should be merged after lancedb/lance#1587 is released.

chore: upgrade lance to v0.8.17 (#656)

a57aa4b

Readying for the next Lance release.

Bump version: 0.3.7 → 0.3.8

123a49d

Updating package-lock.json

22749c3

[python] Bump version: 0.3.3 → 0.3.4

38321fa

Updating package-lock.json

2bb2bb5

fix: python remote correct open_table error message (#659)

6eb662d

(docs):Add CLIP image embedding example (#660)

b085d9a

In this PR, I add a guide that lets you use Roboflow Inference to calculate CLIP embeddings for use in LanceDB. This post was reviewed by @AyushExel.

chore: expose prefilter in lancedb rust (#674)

a2a8f96

expose prefilter flag in vectordb rust code.

feat: enable prefilter in node js (#675)

72765d8

enable prefiltering in node js, both native and remote

fix: fix passing prefilter flag to remote client (#677)

94c8c50

was passing this at the wrong position

Bump version: 0.3.8 → 0.3.9

23d30df

Updating package-lock.json

6bec4be

Updating package-lock.json

6c83b6a

chore: set error handling to immediate (#686)

1336cce

there's build failure for the rust artifact but the macos arm64 build for npm publish still passed. So we had a silent failure for 2 releases. By setting error to immediate this should cause fail immediately.

chore: update package lock (#689)

bbdebf2

saas python sdk doc (#692)

aca785f

<img width="256" alt="Screenshot 2023-12-07 at 11 55 41 AM" src="https://github.com/lancedb/lancedb/assets/1305083/259bf234-9b3b-4c5d-af45-c7f3fada2cc7">

chore: Use m1 runner for npm publish (#687)

244b691

We had some build issues with npm publish for cross-compiling arm64 macos on an x86 macos runner. Switching to m1 runner for now until someone has time to deal with the feature flags. follow-up tracked here: #688

docs: Add badges (#694)

366e522

adding some badges added a gif to readme for the vectordb repo --------- Co-authored-by: kaushal07wick <kaushalc6@gmail.com>

Qian/minor fix doc (#695)

f6bbe19

eddyxu and others added 28 commits January 31, 2024 12:05

Bump version: 0.4.6 → 0.4.7

0c940ed

Updating package-lock.json

1328cd4

Updating package-lock.json

12b4fb4

arrow table/f16 example (#907)

f5726e2

fix the repo link on npm, add links for homepage and bug report (#910)

34e10ca

- fix the repo link on npm - add links for homepage and bug report

ci: bump to new version of python action to use node 20 gIthub action…

62f053a

… runtime (#909) Github action is deprecating old node-16 runtime.

feat: upgrade to lance 0.9.11 and expose merge_insert (#906)

d77e95a

This adds the python bindings requested in #870 The javascript/rust bindings will be added in a future PR.

docs: add cleanup_old_versions and compact_files to Table for docum…

cc9473a

…entation purposes (#900) Closes #819

feat: add merge_insert to the node and rust APIs (#915)

7f8637a

chore: bump pylance version to latest in pyproject.toml (#918)

7783393

[python] Bump version: 0.5.1 → 0.5.2

ce2242e

Bump version: 0.4.7 → 0.4.8

04e1662

Updating package-lock.json

e4bb042

Updating package-lock.json

12a98de

fix: revert safe_import_pandas usage (#921)

688c57a

[python] Bump version: 0.5.2 → 0.5.3

a908822

fix hybrid search example (#922)

e412194

chore: add global cargo config to enable minimal cpu target (#925)

0b0f425

* Closes #895 * Fix cargo clippy

feat(python): add support new openai embedding functions (#912)

738511c

@PrashantDixit0 --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>

add failing unit test

f13b6d4

searchable model

62aec6e

ruff

8a6ed76

ruff

3d59403

ruff

587221a

changhiskhan marked this pull request as ready for review February 15, 2024 04:47

westonpace force-pushed the main branch from 93c8786 to 9fee384 Compare April 5, 2024 23:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): A semantically searchable pydantic model #932

feat(python): A semantically searchable pydantic model #932

changhiskhan commented Feb 6, 2024 •

edited

feat(python): A semantically searchable pydantic model #932

Are you sure you want to change the base?

feat(python): A semantically searchable pydantic model #932

Conversation

changhiskhan commented Feb 6, 2024 • edited

changhiskhan commented Feb 6, 2024 •

edited