Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): A semantically searchable pydantic model #932

Open
wants to merge 321 commits into
base: main
Choose a base branch
from

Conversation

changhiskhan
Copy link
Contributor

@changhiskhan changhiskhan commented Feb 6, 2024

from typing import Optional
from lancedb.pydantic import SearchableModel, Vector
from lancedb.embeddings import get_registry

# Define embedding function
registry = get_registry()
openai = registry.get("openai").create(name="text-embedding-3-small", dim=256)

# Define model (using embedding function)
class Document(SearchableModel):
    id: int
    text: str = openai.SourceField()
    vector: Optional[Vector(openai.ndims())] = openai.VectorField(default=None)

# bind to database
db = lancedb.connect("~/.lancedb")
Document.bind(db)

# ingest data
Document.upsert([Document(id=1, text="hello world"),
                 Document(id=2, text="goodbye world")])

# search
Document.search("greetings").limit(1).get_instances()

The last line should return [Document(id=1, text='hello world', vector=FixedSizeList(dim=256))]

AyushExel and others added 30 commits November 10, 2023 15:02
Colab has pydantic 1.x by default and pydantic 1.x BaseModel objects
don't support weakref creation by default that we use to cache embedding
models
https://github.com/lancedb/lancedb/blob/main/python/lancedb/embeddings/utils.py#L206
. It needs to be added to slot.
Please note: this is not tested as we don't have a server here and
testing against a mock object wouldn't be that interesting.
Sitemap improves SEO by ranking pages and tracking updates.
This is to enable #641.
Should be merged after lancedb/lance#1587 is
released.
Readying for the next Lance release.
In this PR, I add a guide that lets you use Roboflow Inference to
calculate CLIP embeddings for use in LanceDB. This post was reviewed by
@AyushExel.
expose prefilter flag in vectordb rust code.
enable prefiltering in node js, both native and remote
was passing this at the wrong position
there's build failure for the rust artifact but the macos arm64 build
for npm publish still passed. So we had a silent failure for 2 releases.
By setting error to immediate this should cause fail immediately.
We had some build issues with npm publish for cross-compiling arm64
macos on an x86 macos runner. Switching to m1 runner for now until
someone has time to deal with the feature flags.

follow-up tracked here: #688
adding some badges
added a gif to readme for the vectordb repo

---------

Co-authored-by: kaushal07wick <kaushalc6@gmail.com>
eddyxu and others added 28 commits January 31, 2024 12:05
Passed the following tests

```ts
const keyId = process.env.AWS_ACCESS_KEY_ID;
const secretKey = process.env.AWS_SECRET_ACCESS_KEY;
const sessionToken = process.env.AWS_SESSION_TOKEN;
const region = process.env.AWS_REGION;

const db = await lancedb.connect({
  uri: "s3://bucket/path",
  awsCredentials: {
    accessKeyId: keyId,
    secretKey: secretKey,
    sessionToken: sessionToken,
  },
  awsRegion: region,
} as lancedb.ConnectionOptions);

  console.log(await db.createTable("test", [{ vector: [1, 2, 3] }]));
  console.log(await db.tableNames());
  console.log(await db.dropTable("test"))
```
- fix the repo link on npm
- add links for homepage and bug report
… runtime (#909)

Github action is deprecating old node-16 runtime.
This adds the python bindings requested in #870 The javascript/rust
bindings will be added in a future PR.
Adds capability to the remote python SDK to retry requests (fixes #911)

This can be configured through environment:
- `LANCE_CLIENT_MAX_RETRIES`= total number of retries. Set to 0 to
disable retries. default = 3
- `LANCE_CLIENT_CONNECT_RETRIES` = number of times to retry request in
case of TCP connect failure. default = 3
- `LANCE_CLIENT_READ_RETRIES` = number of times to retry request in case
of HTTP request failure. default = 3
- `LANCE_CLIENT_RETRY_STATUSES` = http statuses for which the request
will be retried. passed as comma separated list of ints. default `500,
502, 503`
- `LANCE_CLIENT_RETRY_BACKOFF_FACTOR` = controls time between retry
requests. see
[here](https://github.com/urllib3/urllib3/blob/23f2287eb526d9384dddeedb6f6345e263bb9b86/src/urllib3/util/retry.py#L141-L146).
default = 0.25

Only read requests will be retried:
- list table names
- query
- describe table
- list table indices

This does not add retry capabilities for writes as it could possibly
cause issues in the case where the retried write isn't idempotent. For
example, in the case where the LB times-out the request but the server
completes the request anyway, we might not want to blindly retry an
insert request.
<img width="837" alt="Screenshot 2024-02-01 at 4 23 34 PM"
src="https://github.com/lancedb/lancedb/assets/1305083/4f0f5c5a-2a24-4b00-aad1-ef80a593d964">
[
<img width="838" alt="Screenshot 2024-02-01 at 4 26 03 PM"
src="https://github.com/lancedb/lancedb/assets/1305083/ca073bc8-b518-4be3-811d-8a7184416f07">
](url)

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
@PrashantDixit0

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
This PR refactors how we handle read consistency: does the `LanceTable`
class always pick up modifications to the table made by other instance
or processes. Users have three options they can set at the connection
level:

1. (Default) `read_consistency_interval=None` means it will not check at
all. Users can call `table.checkout_latest()` to manually check for
updates.
2. `read_consistency_interval=timedelta(0)` means **always** check for
updates, giving strong read consistency.
3. `read_consistency_interval=timedelta(seconds=20)` means check for
updates every 20 seconds. This is eventual consistency, a compromise
between the two options above.

## Table reference state

There is now an explicit difference between a `LanceTable` that tracks
the current version and one that is fixed at a historical version. We
now enforce that users cannot write if they have checked out an old
version. They are instructed to call `checkout_latest()` before calling
the write methods.

Since `conn.open_table()` doesn't have a parameter for version, users
will only get fixed references if they call `table.checkout()`.

The difference between these two can be seen in the repr: Table that are
fixed at a particular version will have a `version` displayed in the
repr. Otherwise, the version will not be shown.

```python
>>> table
LanceTable(connection=..., name="my_table")
>>> table.checkout(1)
>>> table
LanceTable(connection=..., name="my_table", version=1)
```

I decided to not create different classes for these states, because I
think we already have enough complexity with the Cloud vs OSS table
references.

Based on #812
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet