Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to mark vectors as "userProvided" #4633

Merged
merged 16 commits into from
May 23, 2024
Merged

Conversation

dureuill
Copy link
Contributor

@dureuill dureuill commented May 14, 2024

Pull Request

Related issue

Fixes #4606

What does this PR do?

See usage in PRD

  • Extends the shape of the special _vectors field in documents.
    • previously, the _vectors field had to be an object, with each field the name of a configured embedder, and each value either null, an embedding (array of numbers), or an array of embeddings.
    • In this PR, the value of an embedder in the _vectors field can additionally be an object. The object has two fields:
      1. embeddings: null, an embedding (array of numbers), or an array of embeddings.
      2. userProvided: a boolean indicating if the vector was provided by the user.
    • The previous form embedder_or_array_of_embedders is semantically equivalent to:
    {
        "embeddings": embedder_or_array_of_embedders,
        "userProvided": true
    }
  • During the indexing step, the subfields and values of the _vectors field that have userProvided set to false are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template.
  • This allows importing embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change.
  • The dump process now uses this ability: it enriches the _vectors fields of documents with the embeddings that were autogenerated, marking them as not userProvided. This allows importing the vectors from a dump without regenerating them.

Tests

This PR adds the following tests

  • Long-needed hybrid search tests of a simple hf embedder
  • Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content.
  • Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected

@Kerollmops Kerollmops marked this pull request as draft May 15, 2024 15:19
@dureuill dureuill force-pushed the dont-regenerate-vecs-in-dump branch from 48776ca to 8a123a4 Compare May 20, 2024 08:35
@dureuill dureuill force-pushed the dont-regenerate-vecs-in-dump branch from 8a123a4 to 1c7f252 Compare May 20, 2024 08:36
@dureuill dureuill marked this pull request as ready for review May 20, 2024 08:37
@dureuill dureuill force-pushed the dont-regenerate-vecs-in-dump branch from 1c7f252 to f25d590 Compare May 20, 2024 08:58
@dureuill
Copy link
Contributor Author

dureuill commented May 20, 2024

Looks like the vectors in the snapshots are different from those computed in the CI... Investigating, but we might need a different way to check here.

@dureuill dureuill force-pushed the dont-regenerate-vecs-in-dump branch from f25d590 to 96e0de9 Compare May 20, 2024 12:22
@dureuill
Copy link
Contributor Author

I made sure the behavior when not passing embeddings in a document update (but only userProvided): it is the same as passing embeddings: null, which results in putting 0 vectors into arroy.

milli/src/error.rs Show resolved Hide resolved
milli/src/error.rs Outdated Show resolved Hide resolved
milli/src/index.rs Outdated Show resolved Hide resolved
milli/src/vector/parsed_vectors.rs Outdated Show resolved Hide resolved
milli/src/update/index_documents/typed_chunk.rs Outdated Show resolved Hide resolved
milli/src/vector/parsed_vectors.rs Outdated Show resolved Hide resolved
@dureuill dureuill requested a review from Kerollmops May 22, 2024 12:16
Kerollmops
Kerollmops previously approved these changes May 22, 2024
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @dureuill. LGTM 🍸
bors merge

meili-bors bot added a commit that referenced this pull request May 22, 2024
4633: Allow to mark vectors as "userProvided" r=Kerollmops a=dureuill

# Pull Request

## Related issue
Fixes #4606 

## What does this PR do?

[See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb)

- Extends the shape of the special `_vectors` field in documents.
    - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings.
    - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields:
      1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings.
      2. `userProvided`: a boolean indicating if the vector was provided by the user.
    - The previous form `embedder_or_array_of_embedders` is semantically equivalent to:
    ```json
    {
        "embeddings": embedder_or_array_of_embedders,
        "userProvided": true
    }
    ```
- During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template.
- This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change.
- The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them.

### Tests

This PR adds the following tests

- Long-needed hybrid search tests of a simple hf embedder
- Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content.
- Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
@Kerollmops
Copy link
Member

bors cancel

Copy link
Contributor

meili-bors bot commented May 22, 2024

Canceled.

@Kerollmops
Copy link
Member

bors merge

meili-bors bot added a commit that referenced this pull request May 22, 2024
4633: Allow to mark vectors as "userProvided" r=Kerollmops a=dureuill

# Pull Request

## Related issue
Fixes #4606 

## What does this PR do?

[See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb)

- Extends the shape of the special `_vectors` field in documents.
    - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings.
    - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields:
      1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings.
      2. `userProvided`: a boolean indicating if the vector was provided by the user.
    - The previous form `embedder_or_array_of_embedders` is semantically equivalent to:
    ```json
    {
        "embeddings": embedder_or_array_of_embedders,
        "userProvided": true
    }
    ```
- During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template.
- This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change.
- The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them.

### Tests

This PR adds the following tests

- Long-needed hybrid search tests of a simple hf embedder
- Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content.
- Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Copy link
Contributor

meili-bors bot commented May 22, 2024

This PR was included in a batch that successfully built, but then failed to merge into main. It will not be retried.

Additional information:

{"message":"At least 1 approving review is required by reviewers with write access.","documentation_url":"https://docs.github.com/articles/about-protected-branches"}

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors merge

meili-bors bot added a commit that referenced this pull request May 22, 2024
4633: Allow to mark vectors as "userProvided" r=Kerollmops a=dureuill

# Pull Request

## Related issue
Fixes #4606 

## What does this PR do?

[See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb)

- Extends the shape of the special `_vectors` field in documents.
    - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings.
    - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields:
      1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings.
      2. `userProvided`: a boolean indicating if the vector was provided by the user.
    - The previous form `embedder_or_array_of_embedders` is semantically equivalent to:
    ```json
    {
        "embeddings": embedder_or_array_of_embedders,
        "userProvided": true
    }
    ```
- During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template.
- This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change.
- The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them.

### Tests

This PR adds the following tests

- Long-needed hybrid search tests of a simple hf embedder
- Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content.
- Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Copy link
Contributor

meili-bors bot commented May 22, 2024

Build failed:

  • Tests on windows-2022

@Kerollmops
Copy link
Member

bors merge

meili-bors bot added a commit that referenced this pull request May 22, 2024
4633: Allow to mark vectors as "userProvided" r=Kerollmops a=dureuill

# Pull Request

## Related issue
Fixes #4606 

## What does this PR do?

[See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb)

- Extends the shape of the special `_vectors` field in documents.
    - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings.
    - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields:
      1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings.
      2. `userProvided`: a boolean indicating if the vector was provided by the user.
    - The previous form `embedder_or_array_of_embedders` is semantically equivalent to:
    ```json
    {
        "embeddings": embedder_or_array_of_embedders,
        "userProvided": true
    }
    ```
- During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template.
- This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change.
- The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them.

### Tests

This PR adds the following tests

- Long-needed hybrid search tests of a simple hf embedder
- Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content.
- Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Copy link
Contributor

meili-bors bot commented May 22, 2024

Build failed:

@dureuill
Copy link
Contributor Author

bors merge

meili-bors bot added a commit that referenced this pull request May 22, 2024
4633: Allow to mark vectors as "userProvided" r=dureuill a=dureuill

# Pull Request

## Related issue
Fixes #4606 

## What does this PR do?

[See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb)

- Extends the shape of the special `_vectors` field in documents.
    - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings.
    - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields:
      1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings.
      2. `userProvided`: a boolean indicating if the vector was provided by the user.
    - The previous form `embedder_or_array_of_embedders` is semantically equivalent to:
    ```json
    {
        "embeddings": embedder_or_array_of_embedders,
        "userProvided": true
    }
    ```
- During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template.
- This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change.
- The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them.

### Tests

This PR adds the following tests

- Long-needed hybrid search tests of a simple hf embedder
- Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content.
- Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Copy link
Contributor

meili-bors bot commented May 22, 2024

Build failed:

  • Tests on windows-2022

@Kerollmops
Copy link
Member

bors merge

Copy link
Contributor

meili-bors bot commented May 23, 2024

@meili-bors meili-bors bot merged commit 14bc80e into main May 23, 2024
8 of 10 checks passed
@meili-bors meili-bors bot deleted the dont-regenerate-vecs-in-dump branch May 23, 2024 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid regenerating embeddings at dump import
2 participants