Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

数据处理模块问题,train_texts长度为1 #11

Open
LightingFx opened this issue Jan 22, 2024 · 0 comments
Open

数据处理模块问题,train_texts长度为1 #11

LightingFx opened this issue Jan 22, 2024 · 0 comments

Comments

@LightingFx
Copy link

您好,咨询以下数据处理模块的问题。我的pt数据路径下共有5个txt文件,在加载阶段也都是可以正常加载,如下所示:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:13<00:00, 2.68s/it]
2024-01-22 17:31:05.728 | INFO | component.dataset:load_dataset:120 - Total num of training text: 5
2024-01-22 17:31:05.728 | INFO | component.dataset:load_dataset:123 - Start tokenizing data ...
0%| | 0/1 [00:00<?, ?it/s]
可以看到加载完数据后开始tokenizing的时候,总数据才是1。我看了下component的dataset.py,在load_datasets里打印了train_texts的长度,结果就是1,也就是说把数据全部添加到train_texts列表的时候,多了一层[],这样的话,在for i in tqdm(range(0, len(train_texts), self.tokenize_batch)) tokenizing时步长是不是不对?而且我测试的时候这一步经常会爆内存(内存320G),看起来是循环出问题了。不清楚为什么会多一层列表,是哪个环节读取txt的时候出的问题呢。
如果我把改为load_datasets里的代码改为train_texts = train_texts[0],可以看到tqdm显示数据总量似乎正确,但这样是会正常处理数据吗?希望可以得到解答。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant