Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Embedding model tuner #1221

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

feat(python): Embedding model tuner #1221

wants to merge 18 commits into from

Conversation

AyushExel
Copy link
Contributor

@AyushExel AyushExel commented Apr 15, 2024

pa.Table.from_pydict(relevant_docs),
save_dir / "relevant_docs.lance",
mode=mode,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:

  • Put all 3 parts of the dataset into a single lance table

Probably doable but the current design treats docs as a single unit with any number of query and response pairs possible..

f"model = get_registry().get('sentence-transformers').create(name='./{self.path}')" # noqa
)

def _wandb_callback(self, score, epoch, steps):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future TODO:

  • This integration doesn't work. Investigate

res = model.evaluate(ds)
assert res is not None


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future TODO:

  • This is an umbrella test. Need more granular ones

model: Any,
trainset: QADataset,
valset: Optional[QADataset] = None,
path: Optional[str] = "~/.lancedb/embeddings/models",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: check conflict with .lance files

llm: BaseLLM,
qa_generate_prompt_tmpl: str = DEFAULT_PROMPT_TMPL,
num_questions_per_chunk: int = 2,
) -> "QADataset":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept generators

nodes: List["TextChunk"],
queries: Dict[str, str],
relevant_docs: Dict[str, List[str]],
) -> "QADataset":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: test api with WANDS dataset

from abc import ABC, abstractmethod


class BaseEmbeddingTuner(ABC):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: get rid of this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant