[WIP] Hybrid search in python #713

changhiskhan · 2023-12-17T00:42:35Z

Proposed v1 hybrid search:

API: table.search(, type="hybrid").rerank(weight=0.5, normalize="rank").limit(10)

Constraints:

The table to have an embedding function configured
Raise exception if more than one vector column / more than one fts index is detected

Behavior:

Do vector search with limit
Do fts search with limit
Combine scores and rerank according to rerank.

weight determines the weight of the vector search score (after normalization).
normalize can be:
None - just raw scores
"auto" (default) - same as "rank"
"rank" - use the rank of the result as the reranking score
"score" - convert the score to standard normal before combining and reranking.

Co-authored-by: Lance Release <lance-dev@lancedb.com> Co-authored-by: Rob Meng <rob.xu.meng@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: rmeng <rob@lancedb.com> Co-authored-by: Chang She <chang@lancedb.com> Co-authored-by: Rok Mihevc <rok@mihevc.org>

If you run the README javascript example in typescript, it complains that the type of limit is a function and cannot be set to a number.

Add `to_list` to return query results as list of python dict (so we're not too pandas-centric). Closes #555 Add `to_pandas` API and add deprecation warning on `to_df`. Closes #545 Co-authored-by: Chang She <chang@lancedb.com>

A little verbose, but better than being non-discoverable ![Screenshot from 2023-10-11 16-26-02](https://github.com/lancedb/lancedb/assets/15766192/9ba539a7-0cf8-4d9e-94e7-ce5d37c35df0)

#488

This PR adds an overview of embeddings docs: - 2 ways to vectorize your data using lancedb - explicit & implicit - explicit - manually vectorize your data using `wit_embedding` function - Implicit - automatically vectorize your data as it comes by ingesting your embedding function details as table metadata - Multi-modal example w/ disappearing embedding function

Bump lance to 0.5.8

Add more APIs to remote table for Node SDK * `add` rows * `overwrite` table with rows * `create` table This has been tested against dev stack

Fix broken link to embedding functions testing: broken link was verified after local docs build to have been repaired --------- Co-authored-by: Chang She <chang@lancedb.com>

Allows creation of funnels and user journeys

Sets things up for this -> #579 - Just separates out the registry/ingestion code from the function implementation code - adds a `get_registry` util - package name "open-clip" -> "open-clip-torch"

To include latest v0.8.6 Co-authored-by: Chang She <chang@lancedb.com>

closes #564 --------- Co-authored-by: Chang She <chang@lancedb.com>

Co-authored-by: Will Jones <willjones127@gmail.com>

…rch (#693) Note this currently the filter/where is only implemented for LocalTable so that it requires an explicit cast to "enable" (see new unit test). The alternative is to add it to the Table interface, but since it's not available on RemoteTable this may cause some user experience issues.

Closes #69 Will not pass until lancedb/lance#1585 is released

Most recent release failed because `release` depends on `node-macos`, but we renamed `node-macos` to `node-macos-{x86,arm64}`. This fixes that by consolidating them back to a single `node-macos` job, which also has the side effect of making the file shorter.

pass vector column name to remote as well. `vector_column` is already part of `Query` just declearing it as part to `remote.VectorQuery` as well

AyushExel · 2024-01-18T10:07:24Z

going to work on top of this branch. I think this makes sense, the only problem is that there isn't a right answer for reranking that fits all-- so maybe we can cover all 3:

One fast but simple reranking method -- the current one
One slow but more efficient -- Transformers based cross encoders
Allow custom callable as a reranking function
I guess then it would cover all bases.

And also run some some comparison on simple dataset

based on #713 - The Reranker api can be plugged into vector only or fts only search but this PR doesn't do that (see example - https://txt.cohere.com/rerank/) ### Default reranker -- `LinearCombinationReranker(weight=0.7, fill=1.0)` ``` table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas() ``` ### Available rerankers LinearCombinationReranker ``` from lancedb.rerankers import LinearCombinationReranker # Same as default table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=LinearCombinationReranker() ).to_pandas() # with custom params reranker = LinearCombinationReranker(weight=0.3, fill=1.0) table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=reranker ).to_pandas() ``` Cohere Reranker ``` from lancedb.rerankers import CohereReranker # default model.. English and multi-lingual supported. See docstring for available custom params table.search("hello", query_type="hybrid").rerank( normalize="rank", # score or rank reranker=CohereReranker() ).to_pandas() ``` CrossEncoderReranker ``` from lancedb.rerankers import CrossEncoderReranker table.search("hello", query_type="hybrid").rerank( normalize="rank", reranker=CrossEncoderReranker() ).to_pandas() ``` ## Using custom Reranker ``` from lancedb.reranker import Reranker class CustomReranker(Reranker): def rerank_hybrid(self, vector_result, fts_result): combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic # Custom rerank logic here return combined_res ``` - [x] Expand testing - [x] Make sure usage makes sense - [x] Run simple benchmarks for correctness (Seeing weird result from cohere reranker in the toy example) - Support diverse rerankers by default: - [x] Cross encoding - [x] Cohere - [x] Reciprocal Rank Fusion --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

based on lancedb#713 - The Reranker api can be plugged into vector only or fts only search but this PR doesn't do that (see example - https://txt.cohere.com/rerank/) ### Default reranker -- `LinearCombinationReranker(weight=0.7, fill=1.0)` ``` table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas() ``` ### Available rerankers LinearCombinationReranker ``` from lancedb.rerankers import LinearCombinationReranker # Same as default table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=LinearCombinationReranker() ).to_pandas() # with custom params reranker = LinearCombinationReranker(weight=0.3, fill=1.0) table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=reranker ).to_pandas() ``` Cohere Reranker ``` from lancedb.rerankers import CohereReranker # default model.. English and multi-lingual supported. See docstring for available custom params table.search("hello", query_type="hybrid").rerank( normalize="rank", # score or rank reranker=CohereReranker() ).to_pandas() ``` CrossEncoderReranker ``` from lancedb.rerankers import CrossEncoderReranker table.search("hello", query_type="hybrid").rerank( normalize="rank", reranker=CrossEncoderReranker() ).to_pandas() ``` ## Using custom Reranker ``` from lancedb.reranker import Reranker class CustomReranker(Reranker): def rerank_hybrid(self, vector_result, fts_result): combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic # Custom rerank logic here return combined_res ``` - [x] Expand testing - [x] Make sure usage makes sense - [x] Run simple benchmarks for correctness (Seeing weird result from cohere reranker in the toy example) - Support diverse rerankers by default: - [x] Cross encoding - [x] Cohere - [x] Reciprocal Rank Fusion --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

based on #713 - The Reranker api can be plugged into vector only or fts only search but this PR doesn't do that (see example - https://txt.cohere.com/rerank/) ### Default reranker -- `LinearCombinationReranker(weight=0.7, fill=1.0)` ``` table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas() ``` ### Available rerankers LinearCombinationReranker ``` from lancedb.rerankers import LinearCombinationReranker # Same as default table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=LinearCombinationReranker() ).to_pandas() # with custom params reranker = LinearCombinationReranker(weight=0.3, fill=1.0) table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=reranker ).to_pandas() ``` Cohere Reranker ``` from lancedb.rerankers import CohereReranker # default model.. English and multi-lingual supported. See docstring for available custom params table.search("hello", query_type="hybrid").rerank( normalize="rank", # score or rank reranker=CohereReranker() ).to_pandas() ``` CrossEncoderReranker ``` from lancedb.rerankers import CrossEncoderReranker table.search("hello", query_type="hybrid").rerank( normalize="rank", reranker=CrossEncoderReranker() ).to_pandas() ``` ## Using custom Reranker ``` from lancedb.reranker import Reranker class CustomReranker(Reranker): def rerank_hybrid(self, vector_result, fts_result): combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic # Custom rerank logic here return combined_res ``` - [x] Expand testing - [x] Make sure usage makes sense - [x] Run simple benchmarks for correctness (Seeing weird result from cohere reranker in the toy example) - Support diverse rerankers by default: - [x] Cross encoding - [x] Cohere - [x] Reciprocal Rank Fusion --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>

AyushExel and others added 30 commits October 8, 2023 23:11

Use query.limit(..) in README (#543)

ababc3f

If you run the README javascript example in typescript, it complains that the type of limit is a function and cannot be set to a number.

feat: add to_list and to_pandas api's (#556)

e1ae2bc

Add `to_list` to return query results as list of python dict (so we're not too pandas-centric). Closes #555 Add `to_pandas` API and add deprecation warning on `to_df`. Closes #545 Co-authored-by: Chang She <chang@lancedb.com>

[Docs] Improve visibility of table ops (#553)

e41894b

A little verbose, but better than being non-discoverable ![Screenshot from 2023-10-11 16-26-02](https://github.com/lancedb/lancedb/assets/15766192/9ba539a7-0cf8-4d9e-94e7-ce5d37c35df0)

feat: cleanup and compaction (#518)

db7bdef

#488

Add cohere embedding function (#550)

683824f

[python] Bump version: 0.3.0 → 0.3.1

1096da0

Bump version: 0.3.0 → 0.3.1

8f6e955

Updating package-lock.json

0bdc714

Updating package-lock.json

f762a66

chore: bump lance to 0.8.5 (#561)

eff94ec

Bump lance to 0.5.8

docs: switch python examples to be row based (#554)

6d66404

feat(python,js): deletion operation on remote tables (#568)

fe64fc4

docs: show source of documented functions (#569)

043e388

implement remote api calls for table mutation (#567)

345c136

Add more APIs to remote table for Node SDK * `add` rows * `overwrite` table with rows * `create` table This has been tested against dev stack

Bump version: 0.3.1 → 0.3.2

02c35d3

Updating package-lock.json

bc85a74

Updating package-lock.json

1b8cda0

doc: fix broken link and add README (#573)

bb01ad5

Fix broken link to embedding functions testing: broken link was verified after local docs build to have been repaired --------- Co-authored-by: Chang She <chang@lancedb.com>

Add pyarrow date and timestamp type conversion from pydantic (#576)

86efb11

list table pagination draft (#574)

d46bc5d

[Docs] Add posthog telemetry to docs (#577)

7372656

Allows creation of funnels and user journeys

[Python]Embeddings API refactor (#580)

0293bbe

Sets things up for this -> #579 - Just separates out the registry/ingestion code from the function implementation code - adds a `get_registry` util - package name "open-clip" -> "open-clip-torch"

[Docs] Update embedding function docs (#581)

a8c7f80

chore: bump lance version in python/rust lancedb (#584)

0ed39b6

To include latest v0.8.6 Co-authored-by: Chang She <chang@lancedb.com>

Bump version: 0.3.2 → 0.3.3

0a30591

Updating package-lock.json

f36fea8

Updating package-lock.json

6bd3a83

[Docs]Versioning docs (#586)

06b5b69

closes #564 --------- Co-authored-by: Chang She <chang@lancedb.com>

albertlockett and others added 17 commits December 13, 2023 14:53

Update in Node & Rust (#696)

63ee8fa

Co-authored-by: Will Jones <willjones127@gmail.com>

feat(python): add update query support for Python (#654)

d087e78

Closes #69 Will not pass until lancedb/lance#1585 is released

[python] Bump version: 0.3.4 → 0.3.5

600bfd7

Bump version: 0.3.9 → 0.3.10

9ec526f

Updating package-lock.json

b6f0a31

feat: support nested pydantic schema (#707)

bd0034a

feat: allow custom column name in query (#709)

7c09b9b

feat: pass vector column name to remote backend (#710)

2d78bff

pass vector column name to remote as well. `vector_column` is already part of `Query` just declearing it as part to `remote.VectorQuery` as well

implement update for remote clients (#706)

57207ef

chore: fix package lock (#711)

ce58ea7

Bump version: 0.3.10 → 0.3.11

57e5695

Updating package-lock.json

2e4ea7d

[python] Bump version: 0.3.5 → 0.3.6

a060804

Updating package-lock.json

ff9872f

initial code for hybrid search

cb64630

changhiskhan force-pushed the changhiskhan/hybrid-search branch from 3c60e4d to cb64630 Compare December 17, 2023 00:53

changhiskhan added 4 commits December 16, 2023 19:24

runs

ff7ba79

fix fill and normalization and ordering

147ce11

run requests in parallel

304e78f

lint

13728bb

prrao87 mentioned this pull request Dec 18, 2023

Develop a hybrid search test bench based on a reference dataset #719

Open

AyushExel mentioned this pull request Jan 18, 2024

feat(python): Hybrid search & Reranker API #824

Merged

6 tasks

westonpace force-pushed the main branch from 93c8786 to 9fee384 Compare April 5, 2024 23:40

alexkohler pushed a commit to alexkohler/lancedb that referenced this pull request Apr 20, 2024

Train OPQ and write rotation matrix to index file (lancedb#713)

3d0ddbf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Hybrid search in python #713

[WIP] Hybrid search in python #713

changhiskhan commented Dec 17, 2023

AyushExel commented Jan 18, 2024 •

edited

[WIP] Hybrid search in python #713

Are you sure you want to change the base?

[WIP] Hybrid search in python #713

Conversation

changhiskhan commented Dec 17, 2023

AyushExel commented Jan 18, 2024 • edited

AyushExel commented Jan 18, 2024 •

edited