-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new splitter to process QA type file(now only support JSON) and add Toggle button in knowledge_base page #3298
base: master
Are you sure you want to change the base?
Conversation
你好,我按照这个代码改了,最后分词还是走了ChineseRecursiveTextSplitter,我看你的截图也是 |
搞定了,是网络问题导致默认选择了其他分词器。 |
嗯嗯 是的。连不上huggingface的会走默认分词器 |
可以跟着上传文件向量化的逻辑看,中间会走到这里: 可以发现会去根据 我也没有特别深入的去研究过,就我这边的使用场景来说:1. 如果使用的是本地模型, 不过刚才我仔细看了一下,你可以尝试将 try:
# ...
if text_splitter_dict[splitter_name]["source"] == "tiktoken": ## 从tiktoken加载
# ...
elif text_splitter_dict[splitter_name]["source"] == "huggingface": ## 从huggingface加载
# ...
else:
try:
text_splitter = TextSplitter(
pipeline="zh_core_web_sm",
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
except:
text_splitter = TextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
except Exception as e:
print(e)
text_splitter_module = importlib.import_module('langchain.text_splitter')
TextSplitter = getattr(text_splitter_module, "RecursiveCharacterTextSplitter")
text_splitter = TextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
# ... |
您好,想问下,我在初始化数据库中,用了qa_text_splitter.py,但只想向量化question的部分,不想向量化answer,这个该如何实现呢?我现在用qa_text_splitter.py之后,是对整个q-a进行了向量化。。。 |
|
我简单实现了一下,在base.py的EmbeddingsFunAdapter的embed_documents方法中,在向量化时用正则表达式把texts的question给提取了出来,这样就可以做到只向量化question |
|
感觉texts直接转字典,然后把question的value取出来也可以,用try来取,我是想在数据库初始化和增量更新时做这个事情,暂时没有考虑前端页面,只向量化问题,检索的阈值就可以设置得更低一些,匹配的更精准 |
大佬,我按照你的代码位置改了,好像没触发print,确定这个qa模式是走的这个方法么 |
I wrote a new splitter to improve the processing of QA-type knowledge(Now only supports JSON, as shown in the example). I also added a Toggle button on the knowledge_base page to switch between the QA splitter and the normal splitter (ChineseRecursiveTextSplitter defined in kb_config.py).
I created a PR because I noticed that many people are encountering the same issue (#3164, #893, and others).
Here are the updated page and test results for the QA splitter: