-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding functionality for multi-column ingestion into vector databases and skills #8990
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this! Just need to update the rest of the embedding handlers, and (preferably) add some tests.
|
||
# rename model's 'embedding_context' column to 'content' | ||
df = df.rename( | ||
columns={TableField.CONTEXT.value: TableField.CONTENT.value} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently only works with langchain_embedding_handler
, because it is the only handler that adds this embedding_context
column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, adding it to the sentence transformer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QuantumPlumber perhaps worth us creating a base Embedding class like we have for vector stores
@@ -161,7 +161,7 @@ def predict(self, df: DataFrame, args) -> DataFrame: | |||
embeddings = model.embed_documents(df_texts.tolist()) | |||
|
|||
# create a new dataframe with the embeddings | |||
df_embeddings = df.copy().assign(**{target: embeddings}) | |||
df_embeddings = df.copy().assign(**{'embedding_context': df_texts, target: embeddings}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to update this for all embedding handlers (e.g. https://github.com/mindsdb/mindsdb/blob/staging/mindsdb/integrations/handlers/sentence_transformers_handler/sentence_transformers_handler.py#L66)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence transformer is more transparent, the input to the model is just the document, so we can duplicate that entry in the dataframe.
@@ -126,6 +128,15 @@ def insert(self, df: pd.DataFrame): | |||
df_emb = self._df_to_embeddings(df) | |||
df = pd.concat([df, df_emb], axis=1) | |||
|
|||
# drop original 'content' column if it exists | |||
if TableField.CONTENT.value in df.columns: | |||
df = df.drop(TableField.CONTENT.value, axis='columns') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original content
may be different from the embedding_context
and is good to have still. Is there any major downside to keeping both columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original content we can name something else.. I dropped it because we have to rename the embedding_context
as the content to preserve the rest of the functionality.
This PR should be covered by #9005 |
Description
Current vector database and knowledge base implementation does not allow for multi-column insertion, even though the
langchain_embeddings
handler does support multi-column embedding. Thelangchain_embeddings
handler concatentates the row entries as context for the embedding, but does not return this concatenation. Therefore, the actual context provided to the embedding model remains hidden. This PR explicitly returns the embedding context as theembedding_context
column.The knowledge base insert statement is modified to handle the protected
embedding_context
column.embedding_context
is registered as a protected column name for the vector database integration.Fixes #issue_number
Type of change
(Please delete options that are not relevant)
Verification Process
To ensure the changes are working as expected:
Tested locally on most up-to-date staging branch.
Additional Media:
Checklist: