feat(python): Embedding model tuner #1221

AyushExel · 2024-04-15T11:37:44Z

Design and usage doc - https://www.notion.so/LanceDB-High-Level-Specs-From-ML-perspective-f9b7470b1e4e4c9e8371ad28b574c185?pvs=4#d6a4f29edf3d4ced9954ab8a913ef9f0

Benchmarks for 5 epochs on 65-35 train test split - https://wandb.ai/cayush/lancedb_finetune?nw=nwusercayush

AyushExel · 2024-04-16T05:40:02Z

python/python/lancedb/embeddings/fine_tuner/dataset.py

+            pa.Table.from_pydict(relevant_docs),
+            save_dir / "relevant_docs.lance",
+            mode=mode,
+        )


TODO:

Put all 3 parts of the dataset into a single lance table

Probably doable but the current design treats docs as a single unit with any number of query and response pairs possible..

AyushExel · 2024-04-16T05:40:37Z

python/python/lancedb/embeddings/sentence_transformers.py

+            f"model = get_registry().get('sentence-transformers').create(name='./{self.path}')"  # noqa
+        )
+
+    def _wandb_callback(self, score, epoch, steps):


Future TODO:

This integration doesn't work. Investigate

AyushExel · 2024-04-16T06:40:39Z

python/python/tests/test_embedding_tuner.py

+    res = model.evaluate(ds)
+    assert res is not None
+
+


Future TODO:

This is an umbrella test. Need more granular ones

AyushExel · 2024-04-22T12:16:46Z

python/python/lancedb/embeddings/sentence_transformers.py

+        model: Any,
+        trainset: QADataset,
+        valset: Optional[QADataset] = None,
+        path: Optional[str] = "~/.lancedb/embeddings/models",


TODO: check conflict with .lance files

AyushExel · 2024-04-22T12:25:46Z

python/python/lancedb/embeddings/fine_tuner/dataset.py

+        llm: BaseLLM,
+        qa_generate_prompt_tmpl: str = DEFAULT_PROMPT_TMPL,
+        num_questions_per_chunk: int = 2,
+    ) -> "QADataset":


Accept generators

AyushExel · 2024-04-22T12:44:43Z

python/python/lancedb/embeddings/fine_tuner/dataset.py

+        nodes: List["TextChunk"],
+        queries: Dict[str, str],
+        relevant_docs: Dict[str, List[str]],
+    ) -> "QADataset":


TODO: test api with WANDS dataset

AyushExel · 2024-04-22T12:50:06Z

python/python/lancedb/embeddings/fine_tuner/basetuner.py

+from abc import ABC, abstractmethod
+
+
+class BaseEmbeddingTuner(ABC):


TODO: get rid of this

AyushExel added 15 commits April 15, 2024 05:59

update

c75bb65

update

878deb7

update

ff00a32

update

9428c6b

remove test file

3ca96a8

add test

d0c1113

ruff

fd8de23

update

6074e6b

update usage

fe5888d

update docs

1a82792

update

ea34c0b

update

6bc488f

ruff

3ebd561

remove protected namespaces

ffbb104

ruff

df404b7

AyushExel requested review from changhiskhan, tanaymeh and raghavdixit99 April 16, 2024 05:31

AyushExel commented Apr 16, 2024

View reviewed changes

AyushExel added 2 commits April 16, 2024 11:13

update benchmark script

c7bb919

ruff

4b0820e

AyushExel commented Apr 16, 2024

View reviewed changes

ruff

db67b27

AyushExel commented Apr 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Embedding model tuner #1221

feat(python): Embedding model tuner #1221

AyushExel commented Apr 15, 2024 •

edited

AyushExel Apr 16, 2024

AyushExel Apr 16, 2024

AyushExel Apr 16, 2024

AyushExel Apr 22, 2024

AyushExel Apr 22, 2024

AyushExel Apr 22, 2024

AyushExel Apr 22, 2024

		from abc import ABC, abstractmethod


		class BaseEmbeddingTuner(ABC):

feat(python): Embedding model tuner #1221

Are you sure you want to change the base?

feat(python): Embedding model tuner #1221

Conversation

AyushExel commented Apr 15, 2024 • edited

AyushExel Apr 16, 2024

Choose a reason for hiding this comment

AyushExel Apr 16, 2024

Choose a reason for hiding this comment

AyushExel Apr 16, 2024

Choose a reason for hiding this comment

AyushExel Apr 22, 2024

Choose a reason for hiding this comment

AyushExel Apr 22, 2024

Choose a reason for hiding this comment

AyushExel Apr 22, 2024

Choose a reason for hiding this comment

AyushExel Apr 22, 2024

Choose a reason for hiding this comment

AyushExel commented Apr 15, 2024 •

edited