Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

欢迎贡献优质中文对话数据集 #2

Open
CrazyBoyM opened this issue Apr 20, 2024 · 3 comments
Open

欢迎贡献优质中文对话数据集 #2

CrazyBoyM opened this issue Apr 20, 2024 · 3 comments
Labels
help wanted Extra attention is needed

Comments

@CrazyBoyM
Copy link
Owner

众所周知,优质中文对话模型的实现离不开好的数据集,但这些好的数据一般只在各个厂商手里掌握,最多只开源出模型文件本体,重要且核心的数据集还是拿不到,因此开帖在此,希望大家来集思广益!
有好的数据集建议或者值得翻译的数据(最好的1v1对话形态的),包括工具调用、agent能力的数据集,欢迎留言在评论区

另外也欢迎大家能来加入交流社群。

@pensacola1989
Copy link

如果用chatGPT做一些数据,效果会好么?

@qwas982
Copy link

qwas982 commented Apr 24, 2024

用大模型生成数据集呀,
就像meta用llama2生成数据集用来训练llama3一样, 不过他们开发了一种生成数据的流水线程序, 这样就能自动生成了, 难怪得到15T这么巨大的数据集语料.
优质的数据集+如果是MoE模型, 那智能程度岂不翻倍了.

@qwas982
Copy link

qwas982 commented Apr 24, 2024

不过还有一种是即时训练,区别于预训练,
这样的话, 当人跟大模型交流的时候, 你的提示字就是源源不断的训练数据集,
再加上让大模型自动生成的数据集,
这样岂不就能获得海量的高质量数据集了吗,

我觉得实现AGI应该就在1年左右, 或几个月内了.

@CrazyBoyM CrazyBoyM added the help wanted Extra attention is needed label Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants